Fastly Outage Exposes Fragility of the Internet

Internet users in the United States awoke to news that a vast swath of the world’s most prominent sites—including leading OTT providers—had gone down in the early morning hours. HBO Max, Hulu, Vimeo, Amazon, Google, Twitter, Spotify, The New York Times, Reddit and The Guardian were among major sites impacted by an outage attributed to cloud computing provider Fastly.

It has tweeted, and also emailed to Streaming Media, that “We identified a service configuration that triggered disruptions across our POPs globally and have disabled that configuration. Our global network is coming back online.”

Continued status updates are available here. [See the end of this article or this Fastly blog post for a June 9 post-morten on the outage—Ed.]

When the outage began Tuesday, Fastly said that it was “investigating potential impact to performance with our CDN services and warned that “customers may experience increased origin load as global services return.”

According to Mark Hendry, director of data protection and cybersecurity at legal business DWF, some of the affected organisations sought to rectify the issue by reverting to non-CDN schemes of distribution.

“However, if this is the case, users of those websites can expect for their experience to be slower than normal until the CDN can be restored,” he says. “Whilst the outage can be considered an availability of services issue, it is not clear at this time whether any underlying data or infrastructure belonging to the affected organisations has become vulnerable as a result of the issue.”

Others have questioned the wisdom of having so much of the internet infrastructure in the hands of a few companies, causing widescale disruption when things go wrong.

Adam Smith, a software testing expert with the BCS, the Chartered Institute for IT, told the BBC that outages with CDNs “highlight the growing ecosystem of complex and coupled components that are involved in delivering internet services. Because of this, outages are increasingly hitting multiple sites and services at the same time.”

Fastly runs edge compute services including Nearline Cache, a service launched last year as the first of its commercial solutions to be built in a serverless compute environment.

The company explains that Nearline Cache allows you to automatically populate and store content in third-party cloud storage near one of Fastly’s POPs “without incurring egress costs, addressing a very real challenge for long-tail content that might get evicted from cache”. With Nearline Cache, it says, “you can populate that content back into cache, resulting in overall cost savings and improved origin offload. Plus, there’s minimal latency and no new work for customers.”

Its operations appear to be built using technology from Swedish developer Varnish Software, whose customers include Hulu, Emirates, and Tesla.

Varnish promptly put out its own Tweet denying responsibility.

Fastly’s customers include Pinterest, The New York Times, and GitHub. Its cloud partners include Google Cloud, AWS and Azure.

It made $291 million in revenue in 2020, up 45% on 2019. While its stock price initially fell on news of the outage, it was back above pre-outage pricing at the time this article was published.

Update (June 9): In an update posted to the Fastly site, Senior Vice President of Engineering and Infrastructure Nick Rockwell blames the outage on “an undiscovered software bug” that surfaced on June 8 when it was triggered by a valid customer configuration change.  

“We detected the disruption within one minute, then identified and isolated the cause, and disabled the configuration. Within 49 minutes, 95% of our network was operating as normal.” 

The root cause is traced to an earlier software deployment in May. 

“Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.” 

A post-mortem of the incident will assess why Fastly didn’t detect the bug during its software quality assurance and testing processes, Rockwell says. 

He reiterates the company’s commitment to the safety of its underlying platforms – WebAssembly [email protected] 

Then the mea culpa: “Even though there were specific conditions that triggered this outage, we should have anticipated it,” he adds. “We provide mission critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority. We apologize to our customers and those who rely on them for the outage and sincerely thank the community for its support.” 

Ironically, Rockwell was CTO at The New York Times, one of the sites hit by the outage. 

Leave a Reply