Processors are equipped with a cache to speed up memory access because main memory access speed failed to keep up with processor speed. On a symmetric multi-processor system, each processor possesses a private cache but all share a memory bus. Cache coherence protocol makes sure that all the private caches maintain a consistent copy of a portion of the main memory even though any processor may write to the main memory any time.
Things get complicated with NUMA where processors now come with their own memory controller and their local main memory. They still have their private cache. Processors no longer share a bus, but they send messages to one another through a point-to-point communication channel. A message may be routed through multiple hops in a network of processors. This is how HyperTransport works. Furthermore, a socket can actually house two multi-core dies, and on a quad socket system it can create quite an interesting topology.
In order to create the illusion of shared memory, a processor will need to talk to another processor for accessing remote main memory. The longer the hops between processors, the higher the latency. But HyperTransport 3.1 operates at 3.2GHz, so we're still talking about just 1 or 2 CPU cycles. Memory access time clocks in at 20 cycles or more.
Current generation AMD processor caches both local as well as remote main memory. Cache coherence over HyperTransport is a lot more complicated. If a processor wishes to conduct a transaction and invalidate the caches of other processors, it must send a message to all processors and wait until all of them to respond. This creates a torrent of messages over the HyperTransport link.
A recent feature called HT assist uses parts of the L3 cache to keep a record of which processors' caches need to be invalidated. This reduces the number of messages that need to be sent, increasing effective memory bandwidth.
Having said all this, here is my epiphany: what if we prohibit a processor from caching remote memory? It will only cache local memory, but will answer memory access on behalf of a remote processor from the cache. This would work well because accessing a remote cache over HyperTransport is still faster than accessing local memory. And we can do without the overheads of cache coherence protocol.
Furthermore, as more programs become NUMA aware, their structure would adapt so that they access primarily memory local to the processor, but leverage remote memory access as a mean of communication. Their performance will not be hindered by not caching remote memory, and it frees up the precious bandwidth on HyperTransport to do useful communication.