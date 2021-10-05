Facebook’s family of products — including Instagram, WhatsApp, and Messenger — started coming back online early Tuesday morning (IST) after being down for nearly six hours in an unprecedented outage affecting billions of users.

The outage which started around 9 pm IST on Monday not only affected Facebook’s own products and users but also websites and apps that use Facebook services like ads and authentication (Login with Facebook).

According to outage-tracking site Downdetector, this was the largest outage the company saw with over 14 million problem reports from all over the globe.

What caused the outage?

Facebook is yet to publish a detailed post on what went wrong but in a short blog post, the company said that the root cause of this outage was a faulty configuration change on the backbone routers that coordinate network traffic between Facebook data centers. “This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt,” the company said. The company also said that there is no evidence that user data was compromised as a result of this downtime.

In a more detailed explanation of what went wrong, CDN provider Cloudflare explained that the outage can be traced back to an issue with something called the BGP or Border Gateway Protocol, which is a mechanism to exchange routing information between different networks on the internet.

“The Internet is literally a network of networks, and it’s bound together by BGP. BGP allows one network (say Facebook) to advertise its presence to other networks that form the Internet,” Cloudflare explained. “Without BGP, the Internet routers wouldn’t know what to do, and the Internet wouldn’t work.”

Facebook services went down because the company’s services stopped advertising its presence and did not allow ISPs and other networks to find Facebook’s network, Cloudflare wrote.

Well, did DNS have anything to do with this outage?

An issue with the DNS or Domain Name System is common to most internet outages, and this outage is no different. This system basically converts human-readable addresses such as facebook.com into machine-readable IP addresses where these websites actually live. DNS resolvers grab this IP address from the domain name servers, typically hosted by the entity that owns it.

In this specific case, Facebook withdrew its BGP route (because of the faulty configuration) that contained the IP addresses of its DNS name servers. As a consequence, DNS resolvers around the globe stopped resolving Facebook domain names and anyone trying to access the site did not know where to go, Cloudflare said.

“In simpler terms, sometime this morning Facebook took away the map telling the world’s computers how to find its various online properties. As a result, when one types Facebook.com into a web browser, the browser has no idea where to find Facebook.com, and so returns an error page.” – Doug Madory, director of internet analysis at Kentik

“But that’s not all. Now human behavior and application logic kicks in and causes another exponential effect. A tsunami of additional DNS traffic follows,” Cloudflare further wrote. “This happened in part because apps won’t accept an error for an answer and start retrying, sometimes aggressively, and in part because end-users also won’t take an error for an answer and start reloading the pages, or killing and relaunching their apps, sometimes also aggressively.” As DNS resolvers started getting overwhelmed, Facebook’s failure started causing unintended side-effects to the rest of the internet.

Here is a more technical explanation of what went wrong and here’s one in layman terms.

What took Facebook so long to fix the issue?

Outages that bring down large swathes of the internet are uncommon but they still happen. In July, an outage at content delivery network (CDN) provider Akamai, affected popular sites like Amazon, Airbnb, Swiggy, Microsoft, Paytm, and Times of India. In June, Fastly, another CDN, took a hit that affected Reddit, Spotify, Shopify, The New York Times, BBC, among others. But in both these cases, the outage lasted for about an hour, unlike the Facebook outage which took the company’s engineers nearly six hours to fix.

Notably, the outage cut off Facebook employees from internal communication tools and physical access to building sites, severely hindering the resolution process.

21/ In theory, fixing this BGP problem should be a quick fix. But when your entire infrastructure is interdependent on itself, then there's a lot of impediments to fixing the core problem. — Robᵉʳᵗ Graham #PcapsOrItDidntHappen (@ErrataRob) October 4, 2021

Renewed calls for the breakup of Facebook

Facebook, WhatsApp, and Instagram all going down at the same time sure seems like an easily-understandable and publicly-popular example of why breaking up a certain monopoly into at least three pieces might not be a bad idea. Somebody should tell Elizabeth Warren. Advertisement. Scroll to continue reading. — Edward Snowden (@Snowden) October 4, 2021

If Facebook’s monopolistic behavior was checked back when it should’ve been (perhaps around the time it started acquiring competitors like Instagram), the continents of people who depend on WhatsApp & IG for either communication or commerce would be fine right now. Break them up. — Alexandria Ocasio-Cortez (@AOC) October 4, 2021

“Maybe one billionaire with a penchant for destroying democracies shouldn’t be allowed to own so much of the internet and maybe that’s why antitrust laws exist that officials who do not take lobbyist money from said billionaire-owned interests should enforce,” US Congresswoman Alexandria Ocasio-Cortez said on Instagram.

“London-based internet monitoring firm Netblocks noted that Facebook’s plans to merge its platforms — announced in 2019 — had raised concerns about the risks of such a move. While such centralization “gives the company a unified view of users’ internet usage habits,” it also makes the services vulnerable to single points of failure, Netblocks said.” — AP News

Funnily, the outage, which affected over 3 billion users, came on the same day that Facebook asked a federal judge to dismiss an antitrust complaint by the Federal Trade Commission because it faces vigorous competition from other services.

Twitter has a field day

Meanwhile, Twitter had a hell of a day as it was the only major social media platform still functioning.

hello literally everyone — Twitter (@Twitter) October 4, 2021

BREAKING: Twitter records highest number of people on its platform at the one time — The Spectator Index (@spectatorindex) October 4, 2021

