Saturday, August 3, 2013

Eliminate Cache Coherence from HyperTransport

Processors are equipped with a cache to speed up memory access because main memory access speed failed to keep up with processor speed. On a symmetric multi-processor system, each processor possesses a private cache but all share a memory bus. Cache coherence protocol makes sure that all the private caches maintain a consistent copy of a portion of the main memory even though any processor may write to the main memory any time.

Things get complicated with NUMA where processors now come with their own memory controller and their local main memory. They still have their private cache. Processors no longer share a bus, but they send messages to one another through a point-to-point communication channel. A message may be routed through multiple hops in a network of processors. This is how HyperTransport works. Furthermore, a socket can actually house two multi-core dies, and on a quad socket system it can create quite an interesting topology.

In order to create the illusion of shared memory, a processor will need to talk to another processor for accessing remote main memory. The longer the hops between processors, the higher the latency. But HyperTransport 3.1 operates at 3.2GHz, so we're still talking about just 1 or 2 CPU cycles. Memory access time clocks in at 20 cycles or more.

Current generation AMD processor caches both local as well as remote main memory. Cache coherence over HyperTransport is a lot more complicated. If a processor wishes to conduct a transaction and invalidate the caches of other processors, it must send a message to all processors and wait until all of them to respond. This creates a torrent of messages over the HyperTransport link.

A recent feature called HT assist uses parts of the L3 cache to keep a record of which processors' caches need to be invalidated. This reduces the number of messages that need to be sent, increasing effective memory bandwidth.

Having said all this, here is my epiphany: what if we prohibit a processor from caching remote memory? It will only cache local memory, but will answer memory access on behalf of a remote processor from the cache. This would work well because accessing a remote cache over HyperTransport is still faster than accessing local memory. And we can do without the overheads of cache coherence protocol.

Furthermore, as more programs become NUMA aware, their structure would adapt so that they access primarily memory local to the processor, but leverage remote memory access as a mean of communication. Their performance will not be hindered by not caching remote memory, and it frees up the precious bandwidth on HyperTransport to do useful communication.

2 comments:

jgoo052@gmail.com said...

Well, it's not quite that simple. it's true that it may be faster to get something out of some other cache than to fetch it from memory. But that's only true if the data is in the cache, of course. And if you need it and don't have it, why would you expect it to magically be in the cache of the processor you're probing?

Not a bad idea -- Cray has been doing this for years. But it's a lot more complicated than you suggest, and probably won't work for unsophisticated code.

Likai Liu said...

If the local processor wants a block that results in a cache miss from the remote processor, then the overhead of the remote processor loading it into cache would still be greater than HyperTransport overhead, and subsequent access would be a cache hit from the remote cache. I'm not arguing that it will magically show up in the local cache.

The actual problematic case arises when a local processor insists on accessing remote (cached) memory, which results in HyperTransport traffic. I'm arguing that programs written to be NUMA aware would not do that. Even with locally cached remote memory, it is far too easy to write a program causing unwanted cache thrashing between processors regardless of the cache coherence protocol, so I'm really arguing that cache coherence is a wasted effort.

Of course, you are right that real workloads are complicated, and all ideas need to be verified experimentally. But it also boils down to the question: do you optimize a CPU to run a particular program faster, or do you design a CPU with clear guidelines to the programmer how to write optimal programs for it? I firmly believe in the latter, whereas cache coherence seems to be a tool that those who believe the former would use.