Friday, November 29, 2013

Aligned data transfer, or Why Headers Must Come Last

When a program on computer A wants to send data to a program on computer B, the data undergoes a series of nested transformation where, at each stage, a header is prepended to the data. Prepending is costly because data must be copied to a new location in order to make space when inserting, or to eliminate space when removing headers, assuming the alignment of data in the destination is important. Alignment is important because of the way memory system is divided into pages, and the way permanent storage is divided into sectors.

The current protocol stack design relies on prepending so much that any implementation inevitably performs a lot of data copying. Copying data is detrimental to performance because of wasted CPU cycles and memory bandwidth. If headers come after the payload, we can append and remove off headers easily with simple memory management, all without affecting payload alignment.

Header prepending is generally not a problem for the sender because network cards support gathering (i.e. vectored I/O) which incurs no cost for packet assembly, but the receiver cannot in general predict the header size. We can program the network card to scatter the payload to a well-aligned memory address only if we know the header size in advance.

Consider sending data as a typical TCP/IP packet. The data undergoes the following transforms:

  • Optional application-level encryption and integrity checking. This adds frame header to tell apart negotiation from data transmission, and also a cipher header. See TLS/SSL.
  • TCP header is prepended by the operating system to encapsulate the data within a logical stream socket, for establishment of connection, in-order delivery, and flow control.
  • IP header is prepended by the operating system for routing information, i.e. source and destination of the packet. Both IPv4 and IPv6 are in active use.
  • Other host-level encapsulation, e.g. VPN, PPPoE.
  • Depending on the link layer, either the network card or the OS would prepend a link layer header. This allows the communication hardware to identify itself over the medium.
  • Other encapsulation and decapsulation headers (e.g. MPLS, DOCSIS) are not visible to the operating system.
Predicting header size is hard primarily because of the dual-stack and host-level encapsulation. One can argue that because host-level encapsulation typically transmits data over slow and high-latency WAN links, performance is less of a concern. But think about the Internet backbone routers that might also need to encapsulate. Any saving in their CPU cycle and memory bandwidth is an increase in overall Internet performance.

If we append instead of prepend headers (of course header would be a misnomer now), headers can be removed by truncation. We still need gathering so that the operating system does not need to rely on the application to allocate sufficient buffer size to accommodate for all headers. Now the scattering only needs to focus on receiving a jumbo packet across multiple discontinuous memory pages.

Network technology can continue to scale, but with the current header prepending protocol design, this requires the CPU and memory technology to scale 2x or more than the network. This is unrealistic, so we must redesign the network protocols to append headers instead.

No comments: