Around 3 pm Eastern Standard Time on June 2, (12.30 am IST on June 3), YouTube, Snapchat, Gmail, Nest, Discord and a number of other web services that rely on the Google Cloud Platform suffered major outages in the US. It took more than four hours for the company to resolve the problem.
Who was affected?
Google Cloud’s official blog reported that YouTube saw a 10% drop in global views during the outage, and Google Cloud Storage saw a 30% reduction in traffic. Only 1% of active Gmail users had problems, while Google Search, a low-bandwidth service, was barely affected.
Not only Google properties, but almost all services that run on Google’s Cloud service were affected on the east coast of the US. Some YouTube and Gmail users in Europe also reported problems. Google’s G Suite Status Dashboard showed practically every Google web service was down.
Snapchat and Vimeo users were also affected as they use Google Cloud on the backend. Shopify, a Canadian e-commerce firm that also supplies retail point-of-sale systems, was also down. As a result, physical and online stores could not process credit card payments.
A Nest user on Y Combinator reported that they couldn’t let guests into their house. (Note that physically letting people in is obviously still possible, but remote access into properties will be disabled by an outage in cloud services.)
9to5Mac reported that many of its iCloud products also experienced problems. Apps were running slower than expected but weren’t completely offline. Apple’s System Status page reflected that. Apple had confirmed last year that it used Google Cloud Platform for iCloud storage, but Google has no access to user data.
What was the reason?
On its official blog, Google said the cause of the disruption was ‘a configuration change that was intended for a small number of servers in a single region’. However, the configuration was incorrectly applied to a number of servers across neighbouring regions. Thus, these regions started using less than half their capacity.
The network traffic to and from those regions then tried to fit into the remaining network capacity, essentially overloading it. Google’s networking systems ‘correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows’.
The issue was detected within seconds. However, the systems that the engineers had to use to rectify the problem were, ironically, also slowed down by the same congestion. As a result, it took much longer than the intended few minutes to rectify the outage.
Google said that they were “conducting a thorough post-mortem to ensure we understand all the contributing factors to both the network capacity loss and the slow restoration”.
Lessons from the outage
The public cloud services market, valued at $182 billion, is dominated by Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform. AWS alone accounts for 40% of the market, and the three together account for 65% of the market. As the latest GCP outage has revealed, even major technology firms such as Apple and Snapchat rely on these three main cloud services providers. As a result, when one of them suffers an outage, a significant chunk of the public internet goes down.
The internet has become increasingly centralised. Because of this, despite numerous levels of redundancy, failure of one system disrupts services around the world. As the world move towards a more integrated system of artefacts in the form of an Internet of Things (IoT), it becomes all the more necessary to decentralise the internet, enable local networks to carry out local tasks, and allow more players to diversify and safeguard the market.