Why Amazon EC2's Titanic went down

NEW YORK (CNNMoney) — This was never supposed to happen.

Amazon Web Services is the Titanic of cloud hosting, designed with backups to the backups’ backups that prevent hosted websites and applications from failing.

Yet, like the famous ocean liner, Amazon’s cloud crashed this week, taking with it Reddit, Quora, FourSquare, Hootsuite, parts of the New York Times, ProPublica and about 70 other sites. The massive outage raised questions about the reliability of AWS and the cloud itself.

It was supposed to work like this: Thousands of companies use AWS to run their websites through a service called Elastic Compute Cloud, or EC2. Rather than hosting their sites on their own servers, these customers turn to Amazon, which essentially rents out its unused — and highly intricate — server capacity.

So what went wrong exactly?

Amazon (AMZN, Fortune 500) has been tight-lipped about the incident, and the company said it won’t be able to fully comment on the situation until it does a “post-mortem.” So it’s not clear yet exactly how the problem occurred.

But bits and pieces of information from Amazon, its customers and cloud experts help to explain what happened.

Thursday’s crash happened at Amazon’s northern Virginia data center, located in one of its East Coast availability zones. In its status log, Amazon said that a “networking event” caused a domino effect across other availability zones in that region, in which many of its storage volumes created new backups of themselves. That filled up Amazon’s available storage capacity and prevented some sites from accessing their data.

Amazon didn’t say what that “networking event” was…

more over on CNN

Blog comments powered by Disqus