Life of a Computer Scientist: networking

Showing posts with label networking. Show all posts

Friday, January 23, 2026

Introduction to Fiber Optics for 10G Network at Home

Fiber optics are becoming cheaper and more affordable for homes nowadays, and they could be a fantastic alternative for RJ-45 10GBase-T Ethernet. Fiber transceivers at the same speed use less power and run cooler. They also don't need to be upgraded like you would need to upgrade an Ethernet cable from Cat5e to Cat6a to now Cat8.

I have worked with fiber network designs at work, at least in theory. But in the last two weeks, I learned even more than I cared for about fiber optics, and here is a quick summary about them before I forget.

Fiber core size: single-mode fiber (SMF, OS2) vs. multimode fiber (MMF, OM5, OM4, OM3, OM2, OM1).

Single mode has a smaller core and can run for longer distances. Multimode fibers may be more cost effective once upon a time, but these days single mode fibers are so cheap, they should be the default choice for even short (< 100m) runs.
Note: the mode type must match with the transceiver as well, so you have to use a single mode fiber with a single mode transceiver.

Number of strands: simplex (1 fiber) or duplex (2 fibers).

The transceivers most commonly used for data center applications will use duplex fibers, using one strand for each direction of transmission. Both directions typically use the same wavelength.
Simplex fibers are more common for Fiber To The Home (FTTH). You have to use a matching (opposite) pair of transceivers at the two ends, e.g. one side is 1330nm-TX/1270nm-RX, and the other side should be 1270nm-TX/1330nm-RX. The two wavelengths traverse on the same fiber strand.

Connector type: SC (standard connector, larger) vs. LC (lucent connector, smaller).

For some reason, the LC transceivers are more commonplace, possibly because you can fit a duplex connector into a transceiver with the SFP form factor.
There are also SC transceivers, but these are less common.
Again, you have to match the connector with the transceiver.

Connector contact type: UPC (domed, color coded blue) vs. APC (angled, color coded green). They refer to how the fiber ending is terminated. The angled type reduces back-reflection.

You have to match the contact type with the transceiver as well. Inserting APC connector to a UPC transceiver can damage the optics.
I typically find either SC/APC or LC/UPC to be more common. The other combinations (e.g. SC/UPC or LC/APC) are possible but uncommon.
Adapters exist but they can cause signal loss. Due to the naming, it can be hard to find the correct adapter by keyword (e.g. "SC/APC female to LC/UPC male" can also give you results for "SC/APC male to LC/UPC female").
Sometimes SC/APC is written as SC/Angled or simply SCA.

Bend radius: more recently, I learned about bend insensitive fibers from a video about the InvisiLight fiber kit. The whole kit cost $250 and includes a pre-terminated G.657.B3 simplex fiber, two 1G transceivers, and two media converters. For my purpose, I would need to buy my own 10G transceivers and a 10G media converter.

Bend insensitive fibers are characterized by having smaller bend radius than the regular fibers (30mm), and they are defined by the standards G.657.A1 (10mm), G.657.A2 or B2 (7.5mm), and G.657.B3 (5mm). This primarily allows installation in tight corners in the home, and the transceivers are agnostic to the bend radius.

FS.com has a custom fiber builder where you can build simplex or duplex fibers with custom core sizes, connector and contact types, and bend radius. They offer white color which is better than bright yellow for home installation. However, their sales representative told me that G.657.A2 with 0.9mm cable diameter has a minimum order quantity (MOQ) of 1KM (this is not on the website). The best they could do with no MOQ is G.657.A1 with a 2.0mm diameter. Perhaps if more people inquire about the G.657.A2 0.9mm, they would eventually lift the MOQ.

There appears to be some pre-terminated G.657.A2 0.9mm cables on eBay, but I have not tested them. Some of the fibers on eBay are also a bit weird, having SC/APC (green) on one side and SC/UPC (blue) on the other, or they are from SC to LC. Either way, a 40 meter (130 feet) long bend insensitive fiber cable should be in the $12-$16 range. Fiber is cheaper than Cat6 per unit length, and they are not limited to 100 meters.

For running between rooms in the home, you would typically use a simplex single-mode and matching LC/LC or SC/SC transceivers, and choose or build a fiber cable with the same connector. Here are some transceiver options I found:

LC/UPC single-mode simplex:

Elfcam branded 10G LC Bidi transceivers (20km), sold as a matching pair for $55.
FS.com branded 10GBASE-BX BiDi transceivers (options from 10km to 100km). Here you have to buy two separate transceivers with matching opposite TX/RX frequencies. A pair of 10km would cost $100.
Ubiquiti branded 10G LC Bidi transceivers (6.2 miles, or 10km), sold as a matching pair for $160.

SC/APC single-mode simplex: Elfcam branded 10G SC Bidi transceivers (20km), sold as a matching pair for $66.

A distance of 10km / 6.2 miles will let you run fiber to your friend's house living in the next town!

For running short distances between devices in the rack room, LC duplex single-mode has more transceiver options. I also use a Direct Attached Cable, which is just copper but with both ends hard-wired to SFP+ connector.

The SFP+ connector refers to the plug form factor where the transceiver is inserted into the switch device. It determines the electrical speed as well.

SFP: 1G.
SFP+: 10G, 25G.
QSFP (or QSFP28, literally "quad SFP"): 100G.

There are four electrical lanes.
When a "breakout transceiver" is plugged in (e.g. 4x25G-LR), a single port can be used as four independent interfaces, e.g. 4x25G.
A "singleton transceiver" (e.g. 100G-LR4) would combine all 4 electrical lanes to drive one SerDes (serializer/deserializer).

QSFP-DD (or QSFP56, or QDD, literally "double density quad SFP"): 400G, 800G.

There are eight electrical lanes that support various breakout transceivers: 4x100G, 8x50G, or 8x100G.
They can also be used together, e.g. 400G-FR4.
Sometimes the breakout transceivers are confusingly named as 400G-DR4 (breaks out to 4x100G-DR) or 400G-XDR4 (breaks out to 4x100G-FR).

OSFP (literally "octa SFP"): 400G, 800G.

There are eight electrical lanes also supporting various breakout transceivers.

The main reason for these "breakout transceivers" is to allow switches to interoperate with previous generation switches, e.g. a switch with QSFP ports would use a breakout transceiver to split each port into four fibers connecting to SFP+ transceivers, or QSFP-DD/OSFP breakout out to QSFP. You don't typically see this on consumer devices, but some of them support breakout transceivers, e.g. QNAP QSW-M7308R-4X (I bought it at $999 a week ago, now it's $1299, probably due to the new shipment having Trump tariffs).

With data centers upgrading from QSFP 100G to QSFP-DD or OSFP 400G or 800G to satisfy the AI computing needs, and with more FTTH deployments, we are seeing more surplus SFP+ 10G transceivers and switches flowing to the consumer market. This could be a good time to upgrade your home network cabling from copper based cable to fiber.

Saturday, February 8, 2025

Polling vs. Interrupt

Recently there is news reverberating across the interwebs that Tiny Linux kernel tweak could cut datacenter power use by 30%. The claim is based on this code commit detailed by the paper Kernel vs. User-Level Networking: Don't Throw Out the Stack with the Interrupts. The main idea is to mix the use of polling and interrupt driven I/O to get the best of both worlds.

Polling is just busy looping to wait for something to happen. It only takes a few CPU cycles each time we check, so the turnaround is very efficient. But if we keep looping and nothing happens, it is wasteful work.

Interrupt driven I/O programs the CPU how to handle an event when something happens, then it tells the CPU to go do something else or go to sleep when there is nothing else to do. The CPU has to save the context of the current program and switch context between programs, so the overhead is greater. But it is more efficient if there is a lot of wait between events.

When we have events that take some time between arrival, but tend to arrive in a bunch (e.g. poisson process with Exponential Distribution intervals), then it makes sense to wait for the next event using an interrupt, then process subsequent events using polling. We can further limit the amount of time spent in polling to equal to the opportunity cost if we were to use an interrupt, and it's a win-win situation: if an event arrives sooner, we could save on the cost of the interrupt minus the cost of polling, but never worse.

This idea is in fact not new. This technique is often used in task schedulers. I remember seeing its use in an old MIT Cilk version circa 1998. I have seen it in some mutex implementations. I have also used the same technique in my Ph.D. work circa 2014 (in the artifact Liu_bu_0017E_429/libikai++-dissertation.tbz2, look for parallel/parallel.cc and the deep_yield() function in particular). This technique just seems very trivial, so I have never seen anyone bother mentioning it.

So will this technique cut datacenter power use by 30%? The paper measured "queries per second" improvements in a memcached microbenchmark, but it is naive to assume that 30% increase in performance automatically translates to 30% reduction in power use. That's assuming all the datacenter does is receiving network packets with no additional work, such as: cryptography, marshaling and unmarshaling of wire protocol, local memory and data storage I/O, calling out to external services, and computing a response. The 30% figure is the upper bound in the most ideal scenario, and it caught the attention of the Internet.

It is a welcoming optimization nonetheless, and the authors have done the hard work to validate their claims with benchmarks and a thorough explanation how it works, so they deserve the credit for the execution. Perhaps someone could continue to audit whether other OS subsystems (e.g. block devices, CPU scheduler) can also benefit from this treatment, for Linux and other OSes.

Tuesday, October 5, 2021

Four Lessons Learned from Facebook BGP Outage

First of all, a disclaimer: I don’t work at Facebook, but enough is known about the outage that I think we can all learn a few lessons from it. Be it a cautionary tale on network design.

On Monday, Oct 4, 2021, Facebook literally unfriended everyone on the Internet. That’s because they’ve broken the BGP peering sessions and withdrawn all BGP routes to their network, so their network became unreachable. CloudFlare has explained how BGP worked in great detail, so I don’t need to repeat it here.

The most immediate effect is that Facebook’s name servers became unreachable, which is what most people immediately notice because name servers are the first thing a web browser would try to contact when you visit a website. At the time of writing, Facebook advertises 4 name servers.

$ host -t ns facebook.com
facebook.com name server d.ns.facebook.com.
facebook.com name server a.ns.facebook.com.
facebook.com name server b.ns.facebook.com.
facebook.com name server c.ns.facebook.com.
$ host a.ns.facebook.com
a.ns.facebook.com has address 129.134.30.12
a.ns.facebook.com has IPv6 address 2a03:2880:f0fc:c:face:b00c:0:35
$ host b.ns.facebook.com
b.ns.facebook.com has address 129.134.31.12
b.ns.facebook.com has IPv6 address 2a03:2880:f0fd:c:face:b00c:0:35
$ host c.ns.facebook.com
c.ns.facebook.com has address 185.89.218.12
c.ns.facebook.com has IPv6 address 2a03:2880:f1fc:c:face:b00c:0:35
$ host d.ns.facebook.com
d.ns.facebook.com has address 185.89.219.12
d.ns.facebook.com has IPv6 address 2a03:2880:f1fd:c:face:b00c:0:35

To be fair, these four IP addresses are probably anycast addresses backed by several distributed clusters of physical servers. Unfortunately, all of them are in the same Autonomous System, AS 32934. Even though their networks are distributed, the AS identifies the scope of BGP peering, so that becomes the single point of failure when BGP is misconfigured.

Lesson 1: If you have multiple name servers, they should be distributed across different Autonomous Systems if feasible. Most people can’t do it, but an organization at the scale of Facebook should get multiple AS numbers.

As a side note, it is common for a big company like Facebook to have just one AS spanning multiple continents. I don’t think that’s a good idea. If a user in Europe wants to visit your server located in Asia, their traffic will enter your peering point in Europe first, which means you have to do internal routing to Asia yourself. That’s great if you build your own inter-continental backbone and want complete control over end user latency, but not so great if you experience traffic surge and want to shed traffic to the other backbone providers. It’s better to have one AS for each continent for locality reason, and have an extra AS or two for global anycast (e.g. for CDN). This gives you greater flexibility in inter-continental traffic engineering. I should probably write another blog post about this some other time.

I also found out that Facebook’s BGP peering is almost fully automated. At the company where I work, we received a notice from Facebook exactly two weeks ago. It says that an idle session has been detected, and “if your session does not establish within 2 weeks, it will be removed.” So two weeks after the notice would coincide with this Facebook outage.

Having looked at this in more detail, I’m skeptical that this message alone could explain the outage. I looked up the IPv6 addresses used for peering, and found that the prefix 2001:7f8:64:225::/64 belonged to Telekom Indonesia (AS 7713), so I believe this address has been allocated for a datacenter in Indonesia that was only used for peering in that region. If so, this message would have only explained regional outage but not worldwide outage.

But if this is any indication, notice that the message has only been viewed 4 times in a whole company of thousands of people doing network operations. It is often the case that many alerts like this have been fired and then ignored.

Lesson 2: Do not ignore automated alerts for a production system.

One thing for certain is that Facebook definitely has an automated system to clean up idle peering sessions. I recently encountered a similar issue with a project at my work where some new code was cleaning up idle sessions too aggressively and caused connectivity issues. It was discovered in a lab so all was well.

The official reason of the outage, according to Facebook,

During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.

In Facebook’s case, their software might have erroneously detected all BGP sessions as idle and then promptly deleted them. This could be prevented if someone would look at the effective changes (“diffs”) to be caused by the command, not just auditing the command itself, and this would allow them to discover problems before executing the command in production. In a production system, it pays to be more careful.

Lesson 3: The effect of production changes should be manually reviewed and signed off before executing.

Last but not the least, BGP was not the only way to define network routes. BGP was designed to find the least costly path over multiple hops of the network, but before BGP, network engineers simply configured static routes. The routing table for the entire Internet is now too large (~900K entries at the time of writing, see CIDR Report for the current size) to be manually configured. Facebook alone advertises around 160 IPv4 prefixes, but they only need to add a few well-known minimum routes as backup to keep basic connectivity. Physical network changes take significantly more effort, so updating the static routing table is just a small overhead at Facebook’s scale.

Lesson 4: Always have a backup minimal configuration when the configuration is decided by an automated algorithm.

That’s all, folks!

Some wisecrack post-scriptum: I’m a programming languages person by training, but somehow over the years I’ve amassed enough networking knowledge to write convincingly about it.

I may have spent more time making the cover graphics than writing, to be honest. Give me a thumbs up if you like the graphic, even if you couldn’t care less about the content.

The opinion expressed does not reflect that of my employer, not that I’ve made it easy to tell who that is either (hopefully).

Friday, January 19, 2018

Embedding a Binary Search Tree in an Array

Recently I started researching what would be the most efficient data structure to represent an IP address lookup table. An IP address table is useful for making routing decisions or access control, which checks whether a single IP address is a member of a range of netblocks. A netblock is represented by a base address and a prefix length, e.g. 192.168.42.0/24 represents a range of IP addresses from 192.168.42.0 through 192.168.42.255. Hash table is not suitable for this purpose because hashes are computed on exact matches. With the given example, this would require entering all 256 addresses in the range into the hash table, which is a significant blowup.

Hardware routers use Ternary Content Addressable Memory which can tell whether a specific value matches anything in the memory, after applying a bitmask. A software implementation might use a prefix tree (trie), which theoretically deduplicates common prefixes among the entries, resulting in a smaller table. However, any data structure employing pointers has the added overhead of the pointers (a machine word) and cache locality problems, which are not considered in theoretical papers.

The most compact data structure for the vast majority of lookup tables is going to be just an array of IP netblocks. For an unsorted array, lookup takes $ O(n) $ time which will be slow for a big table. If we presort the array and use binary search, the lookup takes $ O(\lg n) $ time. But binary search is a pathological case for cache locality because lookups tend to jump back and forth until it settles in to a target. Rather than a binary search which partitions candidates by two parts, it was shown that ternary search (three partition) improves performance, while quaternary search (four partition) improves little over binary search.

It had occurred to me that cache locality for binary search could be improved if we had simply rearranged the presorted array to become an embedding of binary search tree (BST), such that for node at index $ i $, its left and right children are at $ 2i + 1 $ and $ 2i + 2 $ respectively. The scheme is inspired by how binary heap is embedded in an array, but we are embedding a complete BST. The rationale is that the search always occurs left to right in the array, with the most accessed items located at the beginning of the array, which gets the most cache hits.

Here is an example showing how a presorted array with the number sequence $ 0 \dots 9 $ could be embedded.

Note that this embedding works for any presorted array. The values for $i$ are the indices of the BST embedding, and the values for $x$ are the indices in the presorted array. A presorted array can always be represented by a complete BST, so the array is dense. However, we can't embed a prefix tree this way because a prefix tree is sparse.

The implementation strategy is to presort the array, create a new array, and copy the items from the presorted array to the new array in the order of the BST embedding. The preprocessing only needs to be done once, and it results in the most compact representation of the lookup table. The assumption is that once the table is preprocessed, it won't need to be changed again.

I implemented the embedded BST in Go and compared its lookup performance with the builtin sort.Search() which performs binary search on a presorted array. The runs are repeated for increasing array sizes in the powers of 2 as well as stride width (both in terms of the number of machine words, not in bytes). The items to be searched are incremented by the stride width for each iteration; small stride approximates sequential lookup, whereas large stride approximates random lookup. Each of the data point is the average number after running some 2,000,000 to 100,000,000 iterations (the number of iterations varies due to the idiosyncrasies of the Go benchmark framework, not my fault; it tries to run each case for about 5 seconds).

The folly of this is that sequential lookup (smaller stride widths) may not cover a significant portion of a very large array size $2^n$ where $n \ge 20$, but the experiment is fair because (1) both algorithms are subject under the same condition, and (2) the array size above $2^{20}$ already exceeds L2 cache size on the machine running the experiment, which is sufficient to demonstrate performance difference due to cache locality. The performance numbers are collected on a late 2013 model of MacBook Pro 13" whose Core i5 "Haswell" I5-4258U CPU runs at 2.4 GHz and has 32KB L1 cache and 3MB L2 cache. This machine has 8GB DDR3 ram. The program is compiled in 64-bits.

[popup figure in a new window]

You can use the drop-down menus to highlight the series for a specific algorithm (binary search, embedded BST) to see how they perform at various stride widths, or to highlight specific stride widths to compare how the two algorithms perform.

Here are some observations.

At small array sizes $2^n$ where $n \le 7$, all lookups fit in the L1 cache. The two algorithms have identical performance at all stride widths.
From array sizes $2^7$ to $2^{12}$, the L1 misses start to happen. Array size $2^{12}$ occupies 32KiB, the size of L1 cache.
Array sizes $2^{12}$ through $2^{18}$ still fit within the L2 cache, after which the performance starts skyrocketing at the largest stride width (which simulates random access). Even so, embedded BST held up much better than binary search.
At small strides, embedded BST has a more predictable performance than binary search across all array sizes.
Embedded BST performs at least as well as or better than binary search across all array sizes and strides.

Admittedly, embedded BST is optimized for lookup. After the initial preprocessing, we don't try to insert or remove entries from the table. If a modifiable table is desired, we can relax the density of the array so that the BST only needs to be approximately balanced. This allows us to use any classical balanced BST algorithms like red-black tree or AVL tree. It may be worthwhile to investigate the cache performance of array embedded versions of these algorithms as future work.

Some IP address lookup use cases require inserting and deleting individual addresses, such as for the purpose of intrusion detection. Since they only need to match individual addresses rather than netblocks, a hash table would suffice.

Otherwise, embedded BST is a great data structure for immutable IP address range lookup tables that is initialized once and optimized for lookup. It has the most compact representation and has much improved cache locality over binary search.

Friday, January 1, 2016

Address Resolution Protocol for Application Specific Buffers

As part of the c10m problem, which advocates separating data plane from the control plane that routes the data, applications (such as ScyllaDB) have been built to bypass the kernel for data plane, opting to receive and transmit network packets directly from memory mapped buffers of a network card using a framework such as netmap or DPDK. But applications built this way hinges on exclusive control to the network interface. Running additional applications requires additional interfaces.

Previously, the kernel's network stack also acted as a dispatcher so that multiple programs can share one networking interface concurrently and only receive traffic destined for the specific program. On receiving the packet, the network card puts the packet into the next available RX buffer, hands it over to the kernel, and the kernel copies the data out after parsing the packet headers to determine how to route the data. Bypassing the kernel not only bypassed the data plane but also the control plane.

One solution is to let the kernel retain control of all RX buffers and handle the receive events. To dispatch the packet, the kernel would modify the application's address space to memory map the RX buffer for that packet. The problem with this approach is that first access of the mapping causes a soft page fault worth thousands of cycles, and unmapping causes costly TLB shoot-down. Rather than wasting that many cycles on the mapping, we might as well copy the data which can be done through a DMA controller.

If we want to avoid copying and the cost of mapping, then we need to perennially map RX buffers to the address space and somehow design the network interface to dispatch receiving packets to the right buffers. An application is identified by the IP address (layer 3) and port number (layer 4). The kernel traditionally handles the layer 3 and layer 4 semantics, but modifying the network card (layer 2) to understand higher layer is a violation of the abstraction imposed by the OSI model.

To keep a network card only aware of layer 2, each application will be assigned its own MAC address. The packets will only be directed to the RX buffers assigned to the packet's destination MAC address.

Traditionally, a MAC address can have one or more unique IP addresses, but one IP address uniquely maps to one MAC address. This means that each application will require its own IP address. For IPv6 this might not be a big problem, but for IPv4 this is salt on a wound caused by exhaustion of IP addresses which had already happened.

Address Resolution Protocol is used by a network gateway to translate destination IP address to destination MAC address as a packet enters a layer-2 Ethernet LAN from a layer-3 WAN domain. If we were to reuse an IP address across multiple MAC addresses, we need to modify ARP to resolve IP-port instead of just IP alone.

Under this architecture, the application-specific MAC addresses are assigned randomly with collision detection over broadcast. All programs on the same host will still share the same layer-3 address by differentiated by port number. The network gateway effectively handles the predisposition of packets into network interface RX buffers. The kernel retains the canonically assigned hardware MAC address for backwards compatibility. The kernel will still have its own RX buffers: they handle low-bandwidth network control traffic (ARP, ICMP, DHCP, etc.), packets destined to non-existent programs, as well as overflows when an application could not release its own RX buffers fast enough to receive more packets.

In summary, here are the architectural changes I'm proposing:

Modify ARP to resolve IP-port to application specific MAC addresses.
Each application on a host will have their own assigned RX buffers, identified by the application specific MAC address which is randomly assigned to not collide on the LAN.
The RX buffers are perennially mapped into the application's address space.
The kernel still handles the control plane, responding to ARP requests that maps IP-port to application specific MAC address.

This will realize an efficient zero-copy C10M solution with minimal modification to existing network architecture.

Sunday, February 15, 2015

Extended Datagram Protocol

I need a datagram protocol as a basis for transporting RPC over a network. I'd like to call it Extended Datagram Protocol or XDP.

Here are the requirements:

Message size will be larger than MTU up to ~128KB. Most networks support Ethernet MTU of 1500 bytes, which means large packets need to be fragmented. IP fragmentation handles packet loss poorly. If a single fragment is lost, the whole packet will be dropped.
The fragments are sent over UDP. XDP will handle packet assembly with forward error correction. Each fragment will have at most 1280 bytes of payload. It is a nice even number and leaves handsome room for the headers.
The RPC layer will allow messages to arrive out of order, so the datagram protocol does not need to worry about packet ordering.
The datagram protocol does not need to implement message acknowledgement. It may be handled by the RPC layer, or even the application layer.
Encryption will be done at the RPC layer.
Session management will be done at the RPC layer.

I don't have a concrete design yet, but I think it is worth mentioning the rationale behind these requirements. There are a number of proposals that extend UDP such as Reliable User Datagram Protocol and Stream Control Transmission Protocol that don't fit the bill. In-order delivery of packets actually becomes a head-of-line blocker, such is what QUIC tries to avoid.

QUIC (Quick UDP Internet Connections) features packet level error correction, but it is trying to do a lot more such as flow control and encryption. I think flow control should be handled by a layer above (e.g. RPC layer) which can do flow control not only based on network condition but also server load. I don't mind that it handles encryption. However, crypto negotiation packets can easily be larger than the MTU, so I think it would be easier to implement encryption with proper large message support provided by a simpler datagram protocol. I'm not trying to argue that QUIC should be designed differently, only that I think my requirements are different and not well-addressed.

An important design philosophy I have is that we need to share protocol implementation as much as possible, perhaps even across layers. TLS basically invented its own RPC protocol, and it uses X.509 which uses the ASN.1 message serialization scheme that differs from TLS. If I were to layer my own RPC on top of TLS there would be three protocol implementations.

The reason Extended Datagram Protocol does not handle anything besides message fragmentation is that everything else could really be handled by the RPC layer that has a strongly-typed and unified message serialization/deserialization scheme. The only difficulty that message serialization has trouble dealing with is the framing. If we were to send a message over a stream based protocol like TCP, we would first send the message length, then the message body. The sender could do this in one socket write, but the receiver cannot do this in a single read. The receiver would have to first read the message length, then the message itself. The receiver could speculatively read extra bytes in one socket read, but it would need a way to pass the excess data to the next receiver, often complicating the design of the protocol stack.

Wednesday, January 7, 2015

Load Balancing Methods and Trade-Offs

Load balancing is the act of distributing service requests from clients to multiple servers that can perform a service. There are many excellent articles about load balancing, particularly the mechanisms used for the web, but few understand the trade-offs. The mechanisms can be roughly classified as client-side or server-side load balancing.

In client-side load balancing, the client somehow obtains a list of possible server it may use, and it implements the smart to decide how to distribute its own requests among the list of servers it has. Round-robin DNS is such an example, where the client receives a list of servers over DNS query. Very little additional infrastructure is needed other than configuration changes.
In server-side load balancing, the client reaches a server-side load balancer that distributes requests to backend servers. For what the client cares, the load balancer is simply one gigantic server. Some examples are IP Virtual Server and HTTP Reverse Proxy. They require heavier infrastructure investment.

In production, a mix of client and server side load balancing are often used. It's not hard to see why, if we consider the impact of load balancing at the client and at the server. Here we consider just latency and throughput.

Client-side load balancing tends to suffer worse latency than server-side load balancing. The reason is because the client often doesn't have visibility on server availability or load. If a server is overloaded or is offline, then the client may have to wait for a timeout before it tries another server. Propagating server load information to the client does not improve latency because the propagation delay only adds to the overall service time. Server availability is also highly dynamic, so a client cannot reuse this information for an extended period of time. On the other hand, a server-side load balancer is much closer to the backend servers, which means it could obtain server availability much faster. Also, the balancer can amortize the cost of querying server availability over many requests. Server-side load balancing, therefore, incurs no latency penalty to the client.

Client-side load balancing tends to enjoy greater throughput than server-side load balancing. That's because the network path for each client-server path could potentially take different routes. Of course, if a client tries to contact many servers, it would saturate its own uplink first, but if a server's uplink is saturated, we can easily add more servers and more uplinks. On the other hand, a server-side load balancer becomes a unique destination for many clients, so its uplink can be easily saturated. If it is the case that requests are smaller than the responses (which is often the case but not always), then Direct Routing can alleviate some of the load-balancer throughput bottleneck by having the backends return the response directly to the client, but it doesn't scale as easily as client-side load balancing.

The solution often used in production is to deploy both client and server side load balancing. A cluster in server-side load balancing makes a destination highly available, which guarantees the best service time for a given client. But client-side load balancing can send a client to multiple such clusters for better throughput.

Friday, November 29, 2013

Aligned data transfer, or Why Headers Must Come Last

When a program on computer A wants to send data to a program on computer B, the data undergoes a series of nested transformation where, at each stage, a header is prepended to the data. Prepending is costly because data must be copied to a new location in order to make space when inserting, or to eliminate space when removing headers, assuming the alignment of data in the destination is important. Alignment is important because of the way memory system is divided into pages, and the way permanent storage is divided into sectors.

The current protocol stack design relies on prepending so much that any implementation inevitably performs a lot of data copying. Copying data is detrimental to performance because of wasted CPU cycles and memory bandwidth. If headers come after the payload, we can append and remove off headers easily with simple memory management, all without affecting payload alignment.

Header prepending is generally not a problem for the sender because network cards support gathering (i.e. vectored I/O) which incurs no cost for packet assembly, but the receiver cannot in general predict the header size. We can program the network card to scatter the payload to a well-aligned memory address only if we know the header size in advance.

Consider sending data as a typical TCP/IP packet. The data undergoes the following transforms:

Optional application-level encryption and integrity checking. This adds frame header to tell apart negotiation from data transmission, and also a cipher header. See TLS/SSL.
TCP header is prepended by the operating system to encapsulate the data within a logical stream socket, for establishment of connection, in-order delivery, and flow control.
IP header is prepended by the operating system for routing information, i.e. source and destination of the packet. Both IPv4 and IPv6 are in active use.
Other host-level encapsulation, e.g. VPN, PPPoE.
Depending on the link layer, either the network card or the OS would prepend a link layer header. This allows the communication hardware to identify itself over the medium.
Other encapsulation and decapsulation headers (e.g. MPLS, DOCSIS) are not visible to the operating system.

Predicting header size is hard primarily because of the dual-stack and host-level encapsulation. One can argue that because host-level encapsulation typically transmits data over slow and high-latency WAN links, performance is less of a concern. But think about the Internet backbone routers that might also need to encapsulate. Any saving in their CPU cycle and memory bandwidth is an increase in overall Internet performance.

If we append instead of prepend headers (of course header would be a misnomer now), headers can be removed by truncation. We still need gathering so that the operating system does not need to rely on the application to allocate sufficient buffer size to accommodate for all headers. Now the scattering only needs to focus on receiving a jumbo packet across multiple discontinuous memory pages.

Network technology can continue to scale, but with the current header prepending protocol design, this requires the CPU and memory technology to scale 2x or more than the network. This is unrealistic, so we must redesign the network protocols to append headers instead.

Wednesday, August 21, 2013

Essence of a secure socket protocol

This evening I went back and read RFC 5246 Transport Layer Security Protocol 1.2 just for the heck of it. It is pretty complex. Pages 7 through 14 define how to serialize messages. Pages 15 through 63 define the structure of different message types passed between server and client and their meaning. The remaining few pages define how client and server may use a shared secret to exchanging data.

A secure socket protocol really boils down to this. Alice and Bob wish to talk to each other, so they send each other their public keys. The public key is used to encrypt a session key for the recipient which is subsequently used to encrypt the actual message from sender. After receiving each other's public keys, they send the encrypted session key, and encrypted data would follow.

Phase	Alice	Bob
1	Sends public key to Bob.	Sends public key to Alice.
2	On receipt of Bob's public key, tells Bob about how subsequent messages will be encrypted (session key).	On receipt of Alice's public key, tells Alice about how subsequent messages will be encrypted (session key).
3	Starts sending ciphertext to Bob.	Starts sending ciphertext to Alice.

Since TLS is layered on top of TCP, phase one begins after the TCP 3-way handshake, and TLS handshake is at least 6 ways. In total this takes 9 one-way trips or 4.5 round trips.

Phase 1 can be used to advertise the capabilities of the owner of the public key, so that the peer can choose an appropriate encryption and/or compression method. TLS also defines a mechanism to resume the use of a previous session key. This can be folded into Phase 1 as an option. The additional information sent for Phase 1 and 2 must be signed by the sender's public key. Authenticity and integrity of phase 3 communication is done using HMAC with the session key.

Here are some more ideas if I were to design my secure socket protocol from scratch.

A lower level secure socket protocol can embed Phase one with the SYN and Phase 2 with the ACK. If Alice is the client and Bob is the server, Alice would send her public key in SYN, Bob would respond with his public key as well as session key for alice in SYN-ACK, and then Alice will send session key for Bob in ACK. Then both can start communicating in ciphertext.

Notice that Phase 3 does not have to wait for Phase 2. Sender can immediately start sending ciphertext as soon as the session key is sent. This allows us to achieve just 1 round-trip. However, potential for network packet loss means that the sender has to be ready to resend the session key. It is probably easier implementation wise for Phase 3 to wait.

Rather than sending the public key all the times, a variation is to send a key-digest instead. If the recipient does not know the sender's key, then an additional round trip is required to obtain the key from sender. For a 4096-bit public key, this can save 512 bytes for the SYN. This is not a lot given today's high speed network.

Either party may terminate the connection early if:

It does not trust the peer's public key.
It does not understand how subsequent messages could be decrypted.

To ensure ciphertext authenticity and integrity, append HMAC to plaintext to be encrypted together. This makes it harder for a third-party to verify if they decrypted the message successfully. The receiver will decrypt first and then verify the HMAC.

The new secure socket would require extending the BSD socket API as follows:

connect_ex() is like connect() but allows passing arbitrary data, limited by single packet size, as part of the SYN. Upon return, the caller also gets the SYN data from the peer.
accept_ex() is like accept() but allows retrieving SYN data from peer as well as send its own, again limited by single packet size. This function might also take a callback which decides whether to accept or reject the peer's SYN data before sending our own.

The application layer implements its own encryption and compression so it can decide when symmetric key exchange ends and ciphertext communication starts.

Monday, May 20, 2013

Exponential back-off algorithm for IP packet fragmentation

On reading about Path MTU discovery, I found this article Never Ending Story of IP Fragmentation. It explains the problem with bad IP fragmentation when there are multiple encapsulations. Long story short, network links can only carry packets of a certain size called "maximum transmission unit" or MTU. Large packets are split into fragments to fit to the MTU. But MTU is reduced when tunneling IP traffic, e.g. GRE (used by VPN like PPTP and IPsec) and PPPoE. If we naively split a packet by next-hop MTU, then we might get this situation:

Original packet has 1516 bytes to traverse over an Ethernet link with MTU of 1500. It is split into two packets, fragment 1 of 1500 bytes and fragment 2 of 36 bytes (the excess 16 bytes plus additional 20 bytes IP header).
Fragment 1 traverses through a PPPoE tunnel with MTU of 1492 bytes. It is split into two more packets, fragment 1.1 of 1492 bytes and fragment 1.2 of 36 bytes (8 bytes excess + 28 bytes overhead). However, fragment 2 traverses unaffected but now becomes 44 bytes.
Fragment 1.1 traverses through a IPsec tunnel with MTU of 1400 bytes. It is split into two more fragments, fragment 1.1.1 of 1400 bytes and fragment 1.1.2 of 152 bytes (the excess 92 bytes plus 60 bytes encapsulation overhead). The other fragment 1.2 of 36 bytes traverses unaffected but becomes 96 bytes after encapsulation; fragment 2 also traverses unaffected but also becomes 96 bytes.
Fragment 1.1.1 traverses through another tunnel...

If there are N tunnel encapsulations, for each packet we might get N fragments. This is a blow-up of N factor in the total number of packets sent. This incurs cost both in packet switching and assembly.

The industry standard practice is "Path MTU Discovery" or PMTUD. It sends a series of "do not fragment" packets down the pipe until it gets rejected, and the hop that rejects it advertises the MTU. The originator then reduces MTU and tries again. The sole purpose is that the originator of the packet can figure out how much data to send in order to fit the smallest MTU of all the hops until it reaches the destination, at the cost of large connection startup time. Each MTU discovery incurs a partial round-trip cost, so the total round trip time is $ O(n^2) $ where n is the number of hops. This is especially costly for servers that deal with a large number of clients. The server would have to probe and remember each client's path MTU.

What I'm proposing is to get rid of PMTUD altogether but apply the principle of exponential back-off in fragmentation.

Every time a packet has to be fragmented, divide the packet into two more-or-less equally sized halves. Or keep halving the packet $k$ times until the $ \frac1{2_k} $ sized fragment fits to the next link's MTU. The rational is that we're going to be splitting the packet anyway, and halving gives extra headroom for the next hops MTU which might be smaller.

In the worst case where each hop's MTU is exponentially decreasing, you might still get $2^N$ times the number of fragments, but in practice MTU is linearly decreasing. The reason for the linear decrease is due to the extra encapsulation header being stacked on linearly.

Revisiting the above scenario, the original packets would be divided into two of 758 bytes, both traverse through all tunnel layers without further fragmentation, and without knowing the path MTU. Alternatively, after a costly PMTUD, the host might divide the original packets to two of 1400 and 116 bytes respectively, achieving the same effect without PMTUD but with the exponential back-off fragmentation.