Saturday, November 30, 2013

To err is human; to out, pipeline.

Many introductory programming classes begin with the hello world example like this:
#include <stdio.h>

int main() {
  printf("hello world!\n");
  return 0;
From this point on, the notion of writing human readable messages from a program to “standard output” is cemented in the student's mind, and a student continues to do so for the rest of his professional career. This is bad because this is not the intended design of computer systems. The intent is that “standard output” (abbreviated as stdout in a program) be machine readable, while “standard error” (abbrev. stderr) is human readable. A corrected version of hello world would be:
#include <stdio.h>

int main() {
  fprintf(stderr, "hello world!\n");
  return 0;
Pipeline is a feature of Unix that allows you to chain the standard output of one program to the standard input of another. Simple programs can be arbitrarily composed to perform complex functions. It is also a feature of many operating systems inspired by Unix, such as MS-DOS, and its successor, Windows. Standard error is used to report errors, so you typically don't pipeline standard error.

As an example, if a text file contains your personal address book, one contact per line with space separated fields like this:
$ cat contacts 
John Smith 555-1234
John Doe 555-2345
Jane Smith 555-1234
Jane Doe 555-2345
Alan Smithee 555-3456
Smith Johnson 555-4567
Then you could list all last names like this:
$ cut -d' ' -f 2 contacts 
Enumerate the Smiths (last name):
$ cut -d' ' -f 2 contacts | grep '^Smith$'
And count the Smiths:
$ cut -d' ' -f 2 contacts | grep '^Smith$' | wc
       2       2      12
You can sort and list the unique last names:
$ cut -d' ' -f 2 contacts | sort -u
You can do similar things to email and phone area code, all using simple programs that don't do very much on their own, but can be combined using pipeline to perform specific queries or text manipulations:
  • cut: extracts the fields of input lines.
  • grep: prints input lines matching a pattern.
  • wc: counts lines, words, and bytes.
  • sort: sorts and optionally prints only unique lines.
This doesn't work if one of the programs in the pipeline would output error message to standard output.

For a beginner who might confuse the roles of standard output and standard error, the following mnemonic should help.
To err is human; to out, pipeline.
Maybe humans do err too much.

Friday, November 29, 2013

Aligned data transfer, or Why Headers Must Come Last

When a program on computer A wants to send data to a program on computer B, the data undergoes a series of nested transformation where, at each stage, a header is prepended to the data. Prepending is costly because data must be copied to a new location in order to make space when inserting, or to eliminate space when removing headers, assuming the alignment of data in the destination is important. Alignment is important because of the way memory system is divided into pages, and the way permanent storage is divided into sectors.

The current protocol stack design relies on prepending so much that any implementation inevitably performs a lot of data copying. Copying data is detrimental to performance because of wasted CPU cycles and memory bandwidth. If headers come after the payload, we can append and remove off headers easily with simple memory management, all without affecting payload alignment.

Header prepending is generally not a problem for the sender because network cards support gathering (i.e. vectored I/O) which incurs no cost for packet assembly, but the receiver cannot in general predict the header size. We can program the network card to scatter the payload to a well-aligned memory address only if we know the header size in advance.

Consider sending data as a typical TCP/IP packet. The data undergoes the following transforms:

  • Optional application-level encryption and integrity checking. This adds frame header to tell apart negotiation from data transmission, and also a cipher header. See TLS/SSL.
  • TCP header is prepended by the operating system to encapsulate the data within a logical stream socket, for establishment of connection, in-order delivery, and flow control.
  • IP header is prepended by the operating system for routing information, i.e. source and destination of the packet. Both IPv4 and IPv6 are in active use.
  • Other host-level encapsulation, e.g. VPN, PPPoE.
  • Depending on the link layer, either the network card or the OS would prepend a link layer header. This allows the communication hardware to identify itself over the medium.
  • Other encapsulation and decapsulation headers (e.g. MPLS, DOCSIS) are not visible to the operating system.
Predicting header size is hard primarily because of the dual-stack and host-level encapsulation. One can argue that because host-level encapsulation typically transmits data over slow and high-latency WAN links, performance is less of a concern. But think about the Internet backbone routers that might also need to encapsulate. Any saving in their CPU cycle and memory bandwidth is an increase in overall Internet performance.

If we append instead of prepend headers (of course header would be a misnomer now), headers can be removed by truncation. We still need gathering so that the operating system does not need to rely on the application to allocate sufficient buffer size to accommodate for all headers. Now the scattering only needs to focus on receiving a jumbo packet across multiple discontinuous memory pages.

Network technology can continue to scale, but with the current header prepending protocol design, this requires the CPU and memory technology to scale 2x or more than the network. This is unrealistic, so we must redesign the network protocols to append headers instead.

Wednesday, November 27, 2013

REPL kernel and CPU scheduler

Today I saw a picture of BeOS and its kernel debugger. It had occurred to me that a debugger is really the read-eval-print-loop (REPL) of the crude assembly or C language. It makes perfect sense why it's a good idea. BASIC, Scheme, Standard ML, Python, and JavaScript are some example languages featuring a REPL. GDB even features a limited REPL that allows you to call an existing function when debugging a C/C++ program. The REPL makes it easier to inspect the program state and try out some modifications of the program while it's running. It's a powerful rapid prototyping tool. Any serious programming environment must support REPL.

Kernel debugger must be in some way tied to the CPU scheduler, since it needs to suspend execution of the kernel and inspect its state in suspension. This leads me to consider in a very abstract sense how to write a CPU scheduler in general.
void scheduler() {
  while (1) {
    task_t *task = next();  // OS sched. alg.
    if (!task) {
      halt();  // CPU specific.
    run(task);  // CPU specific.
    put(task);  // OS sched. alg.
The scheduler is really an event-driven loop using some CPU specific subroutines and OS scheduler algorithm functions.

The CPU specific functions are:
  • halt() makes the CPU idle, possibly enters power-saving mode, until an interrupt occurs.
  • run() context switches into the task and returns when an interrupt occurs.
Interrupt is the primary way to allow the scheduler to regain control. For example, preemptive scheduling can be implemented by setting a timer interrupt to force run() to return when the interrupt is triggered. It could be a periodic or a one-shot interrupt; we don't really care. For hardware that only has one single-shot timer, we can easily write a timer multiplexer.

To make things simple, assume that everything in the kernel and the user processes can be structured as a task. The run() function would have to discern what type of context switch is appropriate for what type of task.

The OS scheduling functions are:
  • next() to fetch the next task in the scheduling queue that is ready to run.
  • put() places the task back into the scheduling queue.
The scheduler itself has no notion of priorities. It is next() that returns the next task to run. But depending on the implementation of the queue, the actual scheduling decision can be made in either next() or put(). Furthermore, we defer the notion of task completion to put(). If put() determines that a task is done, it will dispose the task in some other way than putting it back to the queue.

In a multi-processor system, each CPU will run its own scheduler and have its own task queue. The next() function might attempt to steal task from another CPU's queue if the current queue is empty. The queue takes ownership of a task, but the task can migrate from one queue to another. Work stealing of distributed queues is a good load balancing strategy, and I think it's the only such strategy that is provably scalable.

Within this scheduler, one possibility is to represent the REPL as a task and schedule it like all other kernel tasks. This is the simplest, but the REPL would run simultaneously with all tasks. It cannot be used to debug the scheduler or examine a freeze state of the CPU. One the other hand, having a REPL that can look at the live state of the kernel is pretty cool. Suspending the kernel is quite easy: simply allow uninterruptible kernel tasks in run(), then REPL can switch between live and freeze states by toggling the interruptible flag of its own task. You can also suspend other CPUs using a boolean variable to force their next() to all return the nil task which makes the CPU halt. You don't need to disable their interrupts. To resume, just reset the variable.

Even so, such REPL won't allow you to step into or over the scheduler or interrupt handlers. A true SoftICE styled debugger is only possible if you hijack the CPU specific implementation of run(). The good news is that it should be able to coexist with our REPL which is already pretty useful.