Yesterday, I was looking for whether OpenBSD supports ZFS (it may have been briefly, but the code is no longer there). As I peruse Ted Unangst's blog, I found a benchmark showing that creating a socket is 10x slower on Linux than OpenBSD, referencing a post by Jann Horn. People quickly put the culprit on RCU, in particular rcu_synchronize().
Read-Copy Update (see also What is RCU, Fundamentally?) is a way to distribute data structures across multiple CPUs that favors the reader by making the writer do more work. Readers can assume their copy is immutable within the scope of some critical section, so they can read it in a non-blocking manner. The writer has to make a private copy first, update it, then publish the new copy to a global reference. However, the writer has to wait for all readers to be done with the old copy before the old copy can be reclaimed. They wait for the readers by calling rcu_synchronize().
The rcu_synchronize() is probably what actually makes the writing slow. It waits for all readers to exit the critical section for the old data, then frees it, before it can make another update again. Worse, an inconsiderate reader might treat the data as a private copy and hold onto the critical section while being blocked on I/O or something else. RCU conservatively prevents old data from accumulating at all, where in reality it might have been fine to have multiple copies of old data, as long as it doesn't grow indefinitely. This is the main idea behind GUS (globally unbounded section); the unboundedness here refers to the writer being unrestrained by the readers lagging behind. This is implemented as Safe Memory Reclamation (SMR) in FreeBSD (subr_smr.c). They use a version sequence number to determine whether the reader has moved on.
It would have been totally reasonable if we just tell the last reader that they are responsible for freeing the data (i.e. reference counting), so the writer can move on. Or perhaps the writer could provide a deleter function upon the last reader exiting the critical section. In any case, old data can only accumulate as far as there are readers, so it is bound by the number of readers.
Hazard Pointer tries to solve a similar problem but for the limited purpose of memory reclamation instead of trying to ensure a consistent view of data structures. A reader would enter the critical section by adding the object in question to a global registry of hazard pointers. Whoever wants to free some object has to check the global registry of hazard pointers to make sure it does not have a reader. This idea was first proposed by Maged Michael et al, 2002, seemingly abandoned for over 20 years, but now making its way into C++26.
But I argue that Hazard Pointer is not scalable. Even if we allow at most one hazard pointer per thread, we still have to support millions of threads in a production system all holding a hazard pointer one way or another, so the sheer size of the registry is phenomenal. The registry also begs the question about its own internal consistency: by the time a writer finishes scanning the registry whether an object is implicated by any hazard pointers in the registry, another reader might have added the hazard pointer to the registry without the scanner knowing.
When it comes to data structure, especially the high performance ones, treating the data as immutable is the way to go. Since modern multicore CPUs use MOESI protocol for cache coherence, each CPU benefits from having its private copy that it can access very quickly without having to go to the main memory, and there is very little memory bus contention.
The problem with RCU is not the immutable paradigm, but how it makes the writer wait for all readers. Also, even if entering the critical section is non-blocking, the reader should exit as soon as it can and avoid lingering in it while being potentially blocked by something else. We should also just make the last reader free the object upon exiting, so versioning is not necessary.
In some extreme cases where you might have a large amount of immutable data tied to some worker threads, it might even make sense to wind down the worker (i.e. lameduck) and spin new workers up with new data, similar to deploying new microservice versions.