Recently there is news reverberating across the interwebs that Tiny Linux kernel tweak could cut datacenter power use by 30%. The claim is based on this code commit detailed by the paper Kernel vs. User-Level Networking: Don't Throw Out the Stack with the Interrupts. The main idea is to mix the use of polling and interrupt driven I/O to get the best of both worlds.
Polling is just busy looping to wait for something to happen. It only takes a few CPU cycles each time we check, so the turnaround is very efficient. But if we keep looping and nothing happens, it is wasteful work.
Interrupt driven I/O programs the CPU how to handle an event when something happens, then it tells the CPU to go do something else or go to sleep when there is nothing else to do. The CPU has to save the context of the current program and switch context between programs, so the overhead is greater. But it is more efficient if there is a lot of wait between events.
When we have events that take some time between arrival, but tend to arrive in a bunch (e.g. poisson process with Exponential Distribution intervals), then it makes sense to wait for the next event using an interrupt, then process subsequent events using polling. We can further limit the amount of time spent in polling to equal to the opportunity cost if we were to use an interrupt, and it's a win-win situation: if an event arrives sooner, we could save on the cost of the interrupt minus the cost of polling, but never worse.
This idea is in fact not new. This technique is often used in task schedulers. I remember seeing its use in an old MIT Cilk version circa 1998. I have seen it in some mutex implementations. I have also used the same technique in my Ph.D. work circa 2014 (in the artifact Liu_bu_0017E_429/libikai++-dissertation.tbz2, look for parallel/parallel.cc and the deep_yield() function in particular). This technique just seems very trivial, so I have never seen anyone bother mentioning it.
So will this technique cut datacenter power use by 30%? The paper measured "queries per second" improvements in a memcached microbenchmark, but it is naive to assume that 30% increase in performance automatically translates to 30% reduction in power use. That's assuming all the datacenter does is receiving network packets with no additional work, such as: cryptography, marshaling and unmarshaling of wire protocol, local memory and data storage I/O, calling out to external services, and computing a response. The 30% figure is the upper bound in the most ideal scenario, and it caught the attention of the Internet.
It is a welcoming optimization nonetheless, and the authors have done the hard work to validate their claims with benchmarks and a thorough explanation how it works, so they deserve the credit for the execution. Perhaps someone could continue to audit whether other OS subsystems (e.g. block devices, CPU scheduler) can also benefit from this treatment, for Linux and other OSes.