Monday, May 20, 2013

Exponential back-off algorithm for IP packet fragmentation

On reading about Path MTU discovery, I found this article Never Ending Story of IP Fragmentation. It explains the problem with bad IP fragmentation when there are multiple encapsulations. Long story short, network links can only carry packets of a certain size called "maximum transmission unit" or MTU. Large packets are split into fragments to fit to the MTU. But MTU is reduced when tunneling IP traffic, e.g. GRE (used by VPN like PPTP and IPsec) and PPPoE. If we naively split a packet by next-hop MTU, then we might get this situation:

  • Original packet has 1516 bytes to traverse over an Ethernet link with MTU of 1500. It is split into two packets, fragment 1 of 1500 bytes and fragment 2 of 36 bytes (the excess 16 bytes plus additional 20 bytes IP header).
  • Fragment 1 traverses through a PPPoE tunnel with MTU of 1492 bytes. It is split into two more packets, fragment 1.1 of 1492 bytes and fragment 1.2 of 36 bytes (8 bytes excess + 28 bytes overhead). However, fragment 2 traverses unaffected but now becomes 44 bytes.
  • Fragment 1.1 traverses through a IPsec tunnel with MTU of 1400 bytes. It is split into two more fragments, fragment 1.1.1 of 1400 bytes and fragment 1.1.2 of 152 bytes (the excess 92 bytes plus 60 bytes encapsulation overhead). The other fragment 1.2 of 36 bytes traverses unaffected but becomes 96 bytes after encapsulation; fragment 2 also traverses unaffected but also becomes 96 bytes.
  • Fragment 1.1.1 traverses through another tunnel...
If there are N tunnel encapsulations, for each packet we might get N fragments. This is a blow-up of N factor in the total number of packets sent. This incurs cost both in packet switching and assembly.

The industry standard practice is "Path MTU Discovery" or PMTUD. It sends a series of "do not fragment" packets down the pipe until it gets rejected, and the hop that rejects it advertises the MTU. The originator then reduces MTU and tries again. The sole purpose is that the originator of the packet can figure out how much data to send in order to fit the smallest MTU of all the hops until it reaches the destination, at the cost of large connection startup time. Each MTU discovery incurs a partial round-trip cost, so the total round trip time is \( O(n^2) \) where n is the number of hops. This is especially costly for servers that deal with a large number of clients. The server would have to probe and remember each client's path MTU.

What I'm proposing is to get rid of PMTUD altogether but apply the principle of exponential back-off in fragmentation.

Every time a packet has to be fragmented, divide the packet into two more-or-less equally sized halves. Or keep halving the packet \(k\) times until the \( \frac1{2_k} \) sized fragment fits to the next link's MTU. The rational is that we're going to be splitting the packet anyway, and halving gives extra headroom for the next hops MTU which might be smaller.

In the worst case where each hop's MTU is exponentially decreasing, you might still get \(2^N\) times the number of fragments, but in practice MTU is linearly decreasing. The reason for the linear decrease is due to the extra encapsulation header being stacked on linearly.

Revisiting the above scenario, the original packets would be divided into two of 758 bytes, both traverse through all tunnel layers without further fragmentation, and without knowing the path MTU. Alternatively, after a costly PMTUD, the host might divide the original packets to two of 1400 and 116 bytes respectively, achieving the same effect without PMTUD but with the exponential back-off fragmentation.

No comments: