Thursday, July 18, 2013

Ideas for improving remote free for StreamFlow allocator

StreamFlow is a memory allocator that uses a simple idea to deal with producer-consumer threads, where an object is allocated from the producer thread, sent to the consumer, and subsequently freed. The object is "remote freed" back to the original heap using an atomic list insertion involving just a compare and swap. The remote free list can be vacated by the producer again using an atomic exchange. There is no need for complex lock-free wait-free algorithm.

But the remote-free lists themselves can become a performance bottleneck, as each remote free causes exclusive cache line to be migrated to the CPU performing the free. A pathological case is when any of the \(N\) threads could free to any other threads, causing cache line ping-pong and memory bus contention.

An idea is to cache an object to be remote-freed in a local free list first. For object sizes smaller than a cache line, reusing this local free list for allocation could cause false sharing, so this local free list is only for later remote free. Once this list reaches a certain length (say 1024), the objects in the list are sorted by the heap into batches, and then each batch is remote-freed in a single compare and swap.

Here is how the sorting would work. The sorter would allocate an array of 1024 pointers on stack, copy the object addresses to this array, and run in-place quicksort. Copying to the array might incur the most cache miss, but subsequent sorting would benefit by \(O(lgn)\) of cache hits. This idea came from What's the fastest algorithm for sorting a linked list? on StackExchange.

For object sizes greater than a cache line, false sharing occurs rarely, so the local free list could be used to satisfy allocation request first in the fashion of tcmalloc. It would be cleared using batched remote free if the list overflows.