Tuesday, October 5, 2021

Four Lessons Learned from Facebook BGP Outage

First of all, a disclaimer: I don’t work at Facebook, but enough is known about the outage that I think we can all learn a few lessons from it. Be it a cautionary tale on network design.

On Monday, Oct 4, 2021, Facebook literally unfriended everyone on the Internet. That’s because they’ve broken the BGP peering sessions and withdrawn all BGP routes to their network, so their network became unreachable. CloudFlare has explained how BGP worked in great detail, so I don’t need to repeat it here.

The most immediate effect is that Facebook’s name servers became unreachable, which is what most people immediately notice because name servers are the first thing a web browser would try to contact when you visit a website. At the time of writing, Facebook advertises 4 name servers.

$ host -t ns facebook.com
facebook.com name server d.ns.facebook.com.
facebook.com name server a.ns.facebook.com.
facebook.com name server b.ns.facebook.com.
facebook.com name server c.ns.facebook.com.
$ host a.ns.facebook.com
a.ns.facebook.com has address
a.ns.facebook.com has IPv6 address 2a03:2880:f0fc:c:face:b00c:0:35
$ host b.ns.facebook.com
b.ns.facebook.com has address
b.ns.facebook.com has IPv6 address 2a03:2880:f0fd:c:face:b00c:0:35
$ host c.ns.facebook.com
c.ns.facebook.com has address
c.ns.facebook.com has IPv6 address 2a03:2880:f1fc:c:face:b00c:0:35
$ host d.ns.facebook.com
d.ns.facebook.com has address
d.ns.facebook.com has IPv6 address 2a03:2880:f1fd:c:face:b00c:0:35

To be fair, these four IP addresses are probably anycast addresses backed by several distributed clusters of physical servers. Unfortunately, all of them are in the same Autonomous System, AS 32934. Even though their networks are distributed, the AS identifies the scope of BGP peering, so that becomes the single point of failure when BGP is misconfigured.

Lesson 1: If you have multiple name servers, they should be distributed across different Autonomous Systems if feasible. Most people can’t do it, but an organization at the scale of Facebook should get multiple AS numbers.

As a side note, it is common for a big company like Facebook to have just one AS spanning multiple continents. I don’t think that’s a good idea. If a user in Europe wants to visit your server located in Asia, their traffic will enter your peering point in Europe first, which means you have to do internal routing to Asia yourself. That’s great if you build your own inter-continental backbone and want complete control over end user latency, but not so great if you experience traffic surge and want to shed traffic to the other backbone providers. It’s better to have one AS for each continent for locality reason, and have an extra AS or two for global anycast (e.g. for CDN). This gives you greater flexibility in inter-continental traffic engineering. I should probably write another blog post about this some other time.

I also found out that Facebook’s BGP peering is almost fully automated. At the company where I work, we received a notice from Facebook exactly two weeks ago. It says that an idle session has been detected, and “if your session does not establish within 2 weeks, it will be removed.” So two weeks after the notice would coincide with this Facebook outage.

Having looked at this in more detail, I’m skeptical that this message alone could explain the outage. I looked up the IPv6 addresses used for peering, and found that the prefix 2001:7f8:64:225::/64 belonged to Telekom Indonesia (AS 7713), so I believe this address has been allocated for a datacenter in Indonesia that was only used for peering in that region. If so, this message would have only explained regional outage but not worldwide outage.

But if this is any indication, notice that the message has only been viewed 4 times in a whole company of thousands of people doing network operations. It is often the case that many alerts like this have been fired and then ignored.

Lesson 2: Do not ignore automated alerts for a production system.

One thing for certain is that Facebook definitely has an automated system to clean up idle peering sessions. I recently encountered a similar issue with a project at my work where some new code was cleaning up idle sessions too aggressively and caused connectivity issues. It was discovered in a lab so all was well.

The official reason of the outage, according to Facebook,

During one of these routine maintenance jobs, a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally. Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command.

In Facebook’s case, their software might have erroneously detected all BGP sessions as idle and then promptly deleted them. This could be prevented if someone would look at the effective changes (“diffs”) to be caused by the command, not just auditing the command itself, and this would allow them to discover problems before executing the command in production. In a production system, it pays to be more careful.

Lesson 3: The effect of production changes should be manually reviewed and signed off before executing.

Last but not the least, BGP was not the only way to define network routes. BGP was designed to find the least costly path over multiple hops of the network, but before BGP, network engineers simply configured static routes. The routing table for the entire Internet is now too large (~900K entries at the time of writing, see CIDR Report for the current size) to be manually configured. Facebook alone advertises around 160 IPv4 prefixes, but they only need to add a few well-known minimum routes as backup to keep basic connectivity. Physical network changes take significantly more effort, so updating the static routing table is just a small overhead at Facebook’s scale.

Lesson 4: Always have a backup minimal configuration when the configuration is decided by an automated algorithm.

That’s all, folks!

Some wisecrack post-scriptum: I’m a programming languages person by training, but somehow over the years I’ve amassed enough networking knowledge to write convincingly about it.

I may have spent more time making the cover graphics than writing, to be honest. Give me a thumbs up if you like the graphic, even if you couldn’t care less about the content.

The opinion expressed does not reflect that of my employer, not that I’ve made it easy to tell who that is either (hopefully).

Thursday, July 15, 2021

Revival of Harvard Architecture for the Mitigation of Ransomware Attacks

Pictured below is the Gropius House in Lincoln, Massachusetts, which was the former residence designed and built by the renowned architect of the Bauhaus movement and emeritus chair of the Department of Architecture at Harvard Graduate School of Design, Walter Gropius.

But the Harvard architecture revival I'm writing about is not that kind of architecture. It was a computer machine architecture in the punched tape era that separated program and data in two different pathways. The Harvard Mark I machine was designed by Howard Aiken at Harvard University. He took inspirations from the Analytical Engine mechanical computer designed by Charles Babbage, who was the mathematics mentor of Ada Lovelace, the first computer programmer and the daughter of poet Lord Byron.

There is poetic justice about the fact that computer programming came out of the conjunction of mathematics and poetry.

What makes Harvard architecture computers notably different from today's computer is that programs were read-only, and data could not be executed as programs. In today's computer, programs and data are stored on the same storage medium. Anyone who can put data on a computer can also run it as a program.

Security breaches happen in two steps: (1) someone has write access to a data path, and (2) subsequently leverages the data path to upload and execute unauthorized programs on a computer they do not own. Gaining write access for step (1) is not that hard. Many computer systems that handle data actively solicit data from non-privileged users. Step (2) relies on having some knowledge of vulnerabilities in a running program that can trick it into passing control flow onto arbitrary data.

Case in point is WordPress, which is a blogging platform notorious for having vulnerabilities that result in security breaches. That's partly because it is built on PHP, a language that makes it easy to run data as program:

  • Once file upload is enabled on a site, PHP will allow it for any script on that site even if that particular script never expects file upload.
  • A moderately complex site breaks down their code into several files, and the first file include() or require() other files to bootstrap the program's functions. The file to be included is a computed path that could be any file on the filesystem.
    • PHP can be configured to limit the path of what could be included using open_basedir. This limits exfiltration of system files that are not part of the public site such as /etc/passwd, but uploaded files are often moved under open_basedir which can then be included.
  • Any file type could be included, even if it is an image file with PHP code embedded in it.

The user only needs to upload a malicious image to WordPress and trick it to run the image. It's less of an issue for a private WordPress installation, but on a shared blogging site like wordpress.com where anyone could register an account, anyone can cause a security breach and steal other people's passwords. That's why people shouldn't reuse passwords on multiple sites.

Once they hijack the computer, they also hijack whatever data was entrusted to be handled on that computer. They may also modify existing legitimate programs on the computer and hide illicit code there, which makes it very costly to discover and clean up after a security incident. This is how ransomware works in a nutshell. Some crooks turned security breaches into a profiteering criminal enterprise.

If today's computers have separate program and data pathways, security breaches would only be limited to the original unauthorized data path access but would not escalate to become a complete hijack of the computer.

Modern computers provide some facilities to separate code and data, but they are optional and must be deployed judiciously in a production environment in order to be effective.

  • Executable space protection designates parts of the memory only for handling data and prohibits executing it as code. This protects programs that are already running from remote code injection vulnerabilities.
  • Enforce writable data to be non-executable (i.e. W^X). This is typically part of the executable space protection for running programs, but the same principle can be applied to programs and data stored on disk.
  • Copy on write is a storage policy that maintains snapshots of data modifications. This can make it easier to restore the filesystem to a former state before the security breach.
  • Some operating systems support making a filesystem read-only, e.g. read-only bind mounts.
  • Some filesystems have the option to make files non-executable on a volume, e.g. zfs exec=off, noexec mount option.
  • Code signing verifies that a program comes from a trusted party and has not been modified in an unauthorized manner. Verification and execution must happen atomically. Otherwise an attacker has an opportunity to modify the program between verification and execution to evade detection.
They all work towards the same principle as Harvard architecture, where program is read-only, and writable data cannot be executed as program.

    Ultimately, the reason why programs must be read-only has to do with accountability requirements imposed by Sarbanes-Oxley Act in response to corporate scandals such as Enron. Programs running in production must be able to establish its provenance, and that someone has reviewed and certified that the program serves its intended purpose. If the program is writable after the certification, then that would put its provenance in doubt.

    In a sense, if a company falls victim to ransomware attack, there must be a loophole in the accountability of its information systems. Revival of Harvard architecture would close the loophole and eliminate ransomware attacks.