Amazon promises to improve redundancy after Dublin outage

From InfoWorld: AWS (Amazon Web Services) learned a lot of lessons from the outage that affected its Dublin data center and will now work to improve power redundancy, load balancing and the way it communicates when something goes wrong with its cloud, the company said in a summary of the incident.

The post mortem delved deeper into what caused the outage, which affected the availability of Amazon's EC2 (Elastic Compute Cloud), EBS (Elastic Block Store), the RDS database and Amazon's network. The service disruption began Aug. 7, at 10:41 a.m., when Amazon's utility provider suffered a transformer failure. At first, a lightning strike was blamed, but the provider now believes it actually wasn't the cause, and is continuing to investigate, according to Amazon.

Normally, when primary power is lost, the electrical load is seamlessly picked up by backup generators. PLCs (Programmable Logic Controllers) assure that the electrical phase is synchronized between generators before their power is brought online. But in this case one of the PLCs did not complete its task, likely because of a large ground fault, which led to the failure of some of the generators as well, according to Amazon.

To prevent this from recurring, Amazon will add redundancy and more isolation for its PLCs so they are insulated from other failures, it said.

Amazon's cloud infrastructure is divided into regions and availability zones. Regions -- for example, the data center in Dublin, which is also called EU West Region -- consists of one or more Availability Zones, which are engineered to be insulated from failures in other zones in the same region. The thinking is that customers can use multiple zones to improve reliability, something which Amazon is working on simplifying.

View: Article @ Source Site