Tuesday, October 5, 2021

Four Lessons Learned from Facebook BGP Outage

First of all, a disclaimer: I don’t work at Facebook, but enough is known about the outage that I think we can all learn a few lessons from it. Be it a cautionary tale on network design.

On Monday, Oct 4, 2021, Facebook literally unfriended everyone on the Internet. That’s because they’ve broken the BGP peering sessions and withdrawn all BGP routes to their network, so their network became unreachable. CloudFlare has explained how BGP worked in great detail, so I don’t need to repeat it here.

The most immediate effect is that Facebook’s name servers became unreachable, which is what most people immediately notice because name servers are the first thing a web browser would try to contact when you visit a website. At the time of writing, Facebook advertises 4 name servers.

$ host -t ns facebook.com
facebook.com name server d.ns.facebook.com.
facebook.com name server a.ns.facebook.com.
facebook.com name server b.ns.facebook.com.
facebook.com name server c.ns.facebook.com.
$ host a.ns.facebook.com
a.ns.facebook.com has address 129.134.30.12
a.ns.facebook.com has IPv6 address 2a03:2880:f0fc:c:face:b00c:0:35
$ host b.ns.facebook.com
b.ns.facebook.com has address 129.134.31.12
b.ns.facebook.com has IPv6 address 2a03:2880:f0fd:c:face:b00c:0:35
$ host c.ns.facebook.com
c.ns.facebook.com has address 185.89.218.12
c.ns.facebook.com has IPv6 address 2a03:2880:f1fc:c:face:b00c:0:35
$ host d.ns.facebook.com
d.ns.facebook.com has address 185.89.219.12
d.ns.facebook.com has IPv6 address 2a03:2880:f1fd:c:face:b00c:0:35

To be fair, these four IP addresses are probably anycast addresses backed by several distributed clusters of physical servers. Unfortunately, all of them are in the same Autonomous System, AS 32934. Even though their networks are distributed, the AS identifies the scope of BGP peering, so that becomes the single point of failure when BGP is misconfigured.

Lesson 1: If you have multiple name servers, they should be distributed across different Autonomous Systems if feasible. Most people can’t do it, but an organization at the scale of Facebook should get multiple AS numbers.

As a side note, it is common for a big company like Facebook to have just one AS spanning multiple continents. I don’t think that’s a good idea. If a user in Europe wants to visit your server located in Asia, their traffic will enter your peering point in Europe first, which means you have to do internal routing to Asia yourself. That’s great if you build your own inter-continental backbone and want complete control over end user latency, but not so great if you experience traffic surge and want to shed traffic to the other backbone providers. It’s better to have one AS for each continent for locality reason, and have an extra AS or two for global anycast (e.g. for CDN). This gives you greater flexibility in inter-continental traffic engineering. I should probably write another blog post about this some other time.

I also found out that Facebook’s BGP peering is almost fully automated. At the company where I work, we received a notice from Facebook exactly two weeks ago. It says that an idle session has been detected, and “if your session does not establish within 2 weeks, it will be removed.” So two weeks after the notice would coincide with this Facebook outage.

Having looked at this in more detail, I’m skeptical that this message alone could explain the outage. I looked up the IPv6 addresses used for peering, and found that the prefix 2001:7f8:64:225::/64 belonged to Telekom Indonesia (AS 7713), so I believe this address has been allocated for a datacenter in Indonesia that was only used for peering in that region. If so, this message would have only explained regional outage but not worldwide outage.

But if this is any indication, notice that the message has only been viewed 4 times in a whole company of thousands of people doing network operations. It is often the case that many alerts like this have been fired and then ignored.

Lesson 2: Do not ignore automated alerts for a production system.

One thing for certain is that Facebook definitely has an automated system to clean up idle peering sessions. I recently encountered a similar issue with a project at my work where some new code was cleaning up idle sessions too aggressively and caused connectivity issues. It was discovered in a lab so all was well.

The official reason of the outage, according to Facebook,

During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.

In Facebook’s case, their software might have erroneously detected all BGP sessions as idle and then promptly deleted them. This could be prevented if someone would look at the effective changes (“diffs”) to be caused by the command, not just auditing the command itself, and this would allow them to discover problems before executing the command in production. In a production system, it pays to be more careful.

Lesson 3: The effect of production changes should be manually reviewed and signed off before executing.

Last but not the least, BGP was not the only way to define network routes. BGP was designed to find the least costly path over multiple hops of the network, but before BGP, network engineers simply configured static routes. The routing table for the entire Internet is now too large (~900K entries at the time of writing, see CIDR Report for the current size) to be manually configured. Facebook alone advertises around 160 IPv4 prefixes, but they only need to add a few well-known minimum routes as backup to keep basic connectivity. Physical network changes take significantly more effort, so updating the static routing table is just a small overhead at Facebook’s scale.

Lesson 4: Always have a backup minimal configuration when the configuration is decided by an automated algorithm.

That’s all, folks!

Some wisecrack post-scriptum: I’m a programming languages person by training, but somehow over the years I’ve amassed enough networking knowledge to write convincingly about it.

I may have spent more time making the cover graphics than writing, to be honest. Give me a thumbs up if you like the graphic, even if you couldn’t care less about the content.

The opinion expressed does not reflect that of my employer, not that I’ve made it easy to tell who that is either (hopefully).

No comments: