What is BGP, and what role did it play in Facebook’s massive outage

On Monday, Facebook was completely knocked offline, taking Instagram and WhatsApp (not to mention a few other websites) down with it. Many have been quick to say that the incident had to do with BGP, or Border Gateway Protocol, citing sources from inside Facebook, traffic analysis, and the gut instinct that “it’s always DNS or BGP.” Facebook is back up and has since released an explanation detailing how BGP was just a part of its woes (and saying that it more or less worked as intended), but this all raises the question:

What is BGP?

At a very basic level, BGP is one of the systems that the internet uses to get your traffic to where it needs to go as quickly as possible. Because there are tons of different internet service providers, backbone routers, and servers responsible for your data making it to, say, Facebook, there’s a ton of different routes your packets could end up taking. BGP’s job is to show them the way and make sure it’s the best route.

I’ve heard BGP described as a system of post offices, an air traffic controller, and more, but I think my favorite explanation was one that likened it to a map. Imagine BGP as a bunch of people making and updating maps that show you how to get to YouTube or Facebook.

When it comes to BGP, the internet is broken up into big networks, known as autonomous systems. You can sort of imagine them as island nations — they’re networks that are controlled by a single entity, which could be an ISP, like Comcast, a company, like Facebook, or some other big organization like a government or major university. It would be extremely difficult to build bridges connecting every island to all the others, so BGP is what’s responsible for telling you which islands (or autonomous systems) you have to go through to get to your destination.

Since the internet is always changing, the maps need to be updated — you don’t want your ISP to lead you down an old road that no longer goes to Google. Because it’d be a massive undertaking to map the entire internet all the time, autonomous systems share their maps. They’ll occasionally talk to their island neighbors to see and copy any updates they’ve made to their maps.

Using maps as a framework, it’s easy to imagine how things can go wrong. Back when consumers first got access to GPS, there were always jokes about it having you drive off a cliff or into the middle of the desert. The same thing can happen with BGP — if someone makes a mistake, it can end up leading traffic somewhere it’s not supposed to go, which will cause problems. If it isn’t caught, that mistake will end up on everyone’s map. There are other ways this can go wrong, but we’ll get to those in a bit.

Of course! This is massively simplified, but imagine you want to connect to an imaginary tech news website called Convergence. Convergence uses the ISP NetSend, and you use DecadeConnect. In this example, DecadeConnect and NetSend can’t talk directly to each other, but your ISP can talk to Border Communications, which can talk to Form, which can talk to NetSend. If that’s the only route, then BGP would make sure that you and Convergence could communicate through it. But if alternatively, both DecadeConnect and NetSend were connected to ThirdLevel, BGP would likely choose to route your traffic through it, as it’s a shorter hop.

Right! Unfortunately, it can get even more complicated because the shortest doesn’t always equal best. There are plenty of reasons why a routing algorithm would choose one path over another — cost can be a factor as well, with some networks charging others if they want to include them in their routes.

Also, maps are super tricky! I discovered this just recently trying to plan a trip where roads existed on one map and not another or were different between maps. One road even had three different names across three maps. If it’s that hard to pin down for a “town” that has all of five roads, imagine what it’s like trying to connect the entire internet together. Real roads don’t change that often, but websites can move from one country to another or change, add, or subtract service providers, and the internet just has to deal with it.

I’ll take your word on that. I dropped out as soon as I heard about graphs.

But Facebook didn’t! In fact, it’s built its own BGP system, which lets it do “fast incremental updates,” according to a paper presented earlier this year. That said, the system the company describes there is meant for communication within data centers — at this point, it’s hard to say what caused Facebook’s problems on Monday, and it’d take someone smarter than me to say whether Facebook’s datacenter communications could cause this kind of issue. Cybersecurity reporter Bryan Krebs claims that the outage was caused by a “routine BGP update.”

In Facebook’s engineering update, the company said that the issue was caused by “configuration changes on the backbone routers that coordinate network traffic between our data centers.” That then led to a “cascading effect on the way [Facebook’s] data centers communicate, bringing [its] services to a halt.” At least to my eye, it reads like the problem was Facebook communicating within itself, not to the outside world (though that can obviously cause a worldwide outage, given how much of its own network stack Facebook controls).

[embedded content]

To borrow an explanation from Cloudflare: DNS tells you where you’re going, and BGP tells you how to get there. DNS is how computers know what IP address a website or other resource can be found at, but that knowledge itself isn’t helpful — if you ask your friend where their house is, you’re still probably going to need GPS to get you there.

Cloudflare also has a great technical rundown of how BGP errors can also mess up DNS requests — the article is specifically about Monday’s Facebook incident, so it’s worth a read if you’re looking for an explanation of what it looked like from an autonomous system’s perspective.

Many things. According to Cloudflare, two notable incidents include a Turkish ISP accidentally telling the entire internet to route its traffic to its service in 2004 and a Pakistani ISP accidentally banning YouTube worldwide after trying to do so only for its users. Because of BGP’s ability to spread from autonomous system to autonomous system (which, as a reminder, is one of the things that makes it so darn useful), one group making a mistake can cascade.

One group getting owned can also cause problems — in 2018, hackers were able to hijack requests to Amazon’s DNS and steal thousands of dollars in Ethereum by compromising a separate ISP’s BGP servers. Amazon wasn’t the one hacked, but traffic meant for it ended up somewhere else.

Or, you can mess up a configuration and delete your entire service off the internet with a bad BGP update. BGP is lovingly called the duct tape of the internet, but no adhesive is perfect.

So what happened to Facebook?

It turns out that BGP played a part in Facebook’s issues but wasn’t the root cause. In its detailed explanation, released on Tuesday, the company says that a command issued as part of routine maintenance accidentally disconnected all of Facebook’s data centers (oops!). When the company’s DNS servers saw that the network backbone was no longer talking to the internet, they stopped sending out BGP advertisements because it was clear that something had gone wrong.

To the wider internet, this looked like Facebook telling everyone to take its servers off their maps. Cloudflare’s CTO reported that the service saw a ton of BGP updates from Facebook (most of which were route withdrawals or erasing lines on the map leading to Facebook) right before it went dark. One of Fastly’s tech leads tweeted that Facebook stopped providing routes to Fastly when it went offline, and KrebsOnSecurity backed up the idea that it was some update to Facebook’s BGP that knocked out its services.

I’d recommend Cloudflare’s explanation if you want nitty-gritty technical details of what it looks like when BGP goes wrong.

In summary, though, yes: Facebook’s BGP system essentially took its service off the map. However, it only did so because the company’s infrastructure was down for other reasons — the Facebook island the maps pointed to more or less no longer existed.

If it took down Facebook’s internal networks, it wouldn’t have been easy. Facebook detailed the difficulties it had bringing its systems back up in its Tuesday blog post, and there were reports of Facebook employees being locked out from badge-protected doors and of employees struggling to communicate. In situations like these, you not only have to figure out who has the knowledge to solve the problem, and who has the permissions to solve the problem, but how to connect those people. And when your entire company is essentially shut down, that’s no easy task — The Verge received reports of engineers being physically sent to a Facebook data center in California to try to fix the problem.

If the problem had been a BGP misconfiguration, Facebook would have needed to make sure that it was advertising the correct records and that those records were picked up by the internet at large. In other words, it’d need to make sure its maps were right and that everyone could see them.

Stop it. I will cry.

But to quickly answer the question, probably not — even if Facebook hopped on the decentralized train, there’d still have to be some protocol telling you where to find its resources. We’ve seen that it’s possible to misconfigure or mess up blockchain contracts before, so I’d be a bit suspicious of anyone who said that a contract and blockchain-based internet would be immune to this kind of issue.

Right, so obviously, the fact that this all happened while a whistleblower was going on TV and airing out Facebook’s dirty laundry makes it really easy to come up with alternative explanations. But it’s just as possible that this is an innocent mistake that some (very, very unfortunate) person on Facebook’s IT staff made.

For what it’s worth, that’s Facebook’s explanation. It lays the blame on a “faulty configuration change” that it made, not any devious hacks.

Update October 4th, 10:44PM ET: Updated with information from Facebook’s official engineering post.

Update October 5th, 2:33PM ET: Updated with explanation from Facebook’s new engineering post, which detailed that an incorrect command that brought down its network was the root cause of the issue, and its BGP system’s role in the outage.

https://www.theverge.com/2021/10/4/22709260/what-is-bgp-border-gateway-protocol-explainer-internet-facebook-outage