Saturday, June 5, 2010

Parallel computing and the cost of TLB shoot-down

Translation lookaside buffer (TLB) is part of the CPU's virtual memory facility to cache the mapping of virtual memory page to physical memory page, which speeds up the lookup of virtual address to physical address. Without TLB, virtual memory would require several physical memory accesses for every virtual memory access for looking up from the page table. TLB makes virtual memory practical, and it is one of the earliest form of memory caching. Virtual memory gives the illusion that the system has more available memory than actually installed. The operating system automatically arbitrates the use of main memory, which alleviates the programmer the burden of overlay planning. It is said to dramatically increase programmer productivity. The way virtual memory works is described in detail in The Locality Principle.

In the very beginning, each process running on the operating system corresponds to one virtualized computing resource. It has exactly one CPU, and it runs in an isolated address space. On a shared-memory multi-processor system, each processor will run one process. They only need to synchronize (wait for each other) when a process causes the operating system to acquire or release physical memory; but then, only the processes that acquire or release memory needs to be synchronized. All other processes can run uninterrupted and in parallel.

When the operating system introduces multi-threading support, things get complicated. Originally, multi-threading allows concurrency in an I/O bound program, which alleviates the burden of the programmer to multiplex the control flow over I/O operations that happen concurrently. However, multi-threading also allows a CPU bound process, which as multiple threads of execution, to run across several processors, all of which would share the same address space. And in this case, each processor has its own virtual memory unit, in particular the TLB, that has to be synchronized whenever there is a change to address space mapping.

The synchronization generally only needs to happen when an existing mapping is modified, or when unmapping occurs. When creating a new mapping, it is guaranteed that the TLB entry for the new mapping wouldn't exist, so no synchronization is needed. If the TLB is not synchronized, then other threads may still have access to the old mapping, which might be a physical memory page reused for other purpose. The synchronization uses a technique called TLB shoot-down.

When the mapping change happens, the processor performs the change, flags the shoot-down request at a shared memory location, and then triggers an Inter-Processor Interrupt (IPI) to other processors. When those processors receive the interrupt, they clear the TLB cache, and acknowledge that the request has been completed. The original processor has to wait until all processors acknowledge the completion.

One can imagine why TLB shoot-down is costly. The most direct consequence is that TLB has to be rebuilt, which is only cost-effective if the cost is amortized over time. Frequent TLB shoot-down will cause the virtual memory to perform poorly.

Furthermore, not only the processor that makes the change has to synchronize with the change; all the processors that share the address space are affected, regardless whether they are interested in the change or not. CPU-bound processes that want to scale typically have their threads all dealing with a private memory area in a shared address space. This is to take advantage of the CPU cache, which they use for local memory, so that bus traffic may be reduced. Bus traffic is the bottleneck of scalability in any parallel computer. However, TLB shoot-down will slow down these other threads even if they effectively use their own private memory area. The cost of TLB shoot-down increases as the number of CPUs increase. This is studied extensively and serves as a motivation of a multi-kernel called Barrelfish.

It looks like if you're looking to write a shared-memory parallel program that is CPU-bound, and want to scale to a large number of processors, you would want to reduce the number of unmap operations. Of course, the alternative is break down the threads and revert to using several address isolated processes that communicate using well-defined channels such as MPI. This will also work well on distributed systems.

No comments: