Previously, the kernel's network stack also acted as a dispatcher so that multiple programs can share one networking interface concurrently and only receive traffic destined for the specific program. On receiving the packet, the network card puts the packet into the next available RX buffer, hands it over to the kernel, and the kernel copies the data out after parsing the packet headers to determine how to route the data. Bypassing the kernel not only bypassed the data plane but also the control plane.
One solution is to let the kernel retain control of all RX buffers and handle the receive events. To dispatch the packet, the kernel would modify the application's address space to memory map the RX buffer for that packet. The problem with this approach is that first access of the mapping causes a soft page fault worth thousands of cycles, and unmapping causes costly TLB shoot-down. Rather than wasting that many cycles on the mapping, we might as well copy the data which can be done through a DMA controller.
If we want to avoid copying and the cost of mapping, then we need to perennially map RX buffers to the address space and somehow design the network interface to dispatch receiving packets to the right buffers. An application is identified by the IP address (layer 3) and port number (layer 4). The kernel traditionally handles the layer 3 and layer 4 semantics, but modifying the network card (layer 2) to understand higher layer is a violation of the abstraction imposed by the OSI model.
To keep a network card only aware of layer 2, each application will be assigned its own MAC address. The packets will only be directed to the RX buffers assigned to the packet's destination MAC address.
Traditionally, a MAC address can have one or more unique IP addresses, but one IP address uniquely maps to one MAC address. This means that each application will require its own IP address. For IPv6 this might not be a big problem, but for IPv4 this is salt on a wound caused by exhaustion of IP addresses which had already happened.
Address Resolution Protocol is used by a network gateway to translate destination IP address to destination MAC address as a packet enters a layer-2 Ethernet LAN from a layer-3 WAN domain. If we were to reuse an IP address across multiple MAC addresses, we need to modify ARP to resolve IP-port instead of just IP alone.
Under this architecture, the application-specific MAC addresses are assigned randomly with collision detection over broadcast. All programs on the same host will still share the same layer-3 address by differentiated by port number. The network gateway effectively handles the predisposition of packets into network interface RX buffers. The kernel retains the canonically assigned hardware MAC address for backwards compatibility. The kernel will still have its own RX buffers: they handle low-bandwidth network control traffic (ARP, ICMP, DHCP, etc.), packets destined to non-existent programs, as well as overflows when an application could not release its own RX buffers fast enough to receive more packets.
In summary, here are the architectural changes I'm proposing:
- Modify ARP to resolve IP-port to application specific MAC addresses.
- Each application on a host will have their own assigned RX buffers, identified by the application specific MAC address which is randomly assigned to not collide on the LAN.
- The RX buffers are perennially mapped into the application's address space.
- The kernel still handles the control plane, responding to ARP requests that maps IP-port to application specific MAC address.
This will realize an efficient zero-copy C10M solution with minimal modification to existing network architecture.