The Azure cloud platform lost a data centre for a number of hours recently due to inclement weather. This affected many customers — including Microsoft’s own services — for almost an entire day.
Given that the cloud is ostensibly designed to mitigate downtime by way of distributing workloads across multiple redundant systems, this could have been an embarrassing look for Microsoft, and every news outlet could have brought attention to the failure and made fun of it.
Except they did not, and this is a good thing. People were made aware of the failure, and Microsoft communicated well on its cause and resolution progress. Ultimately, they were able to bring everything back online after about 19 hours, with zero data loss.
This incident demonstrated a number of things:
- The “new” Microsoft under Satya Nadella seems to be a lot more open about how they deal with problems.
- Data loss and downtime typically go hand-in-hand. In this case, Microsoft prioritised data retention over uptime. In their Root-Cause Analysis (RCA) after the fact, they documented valid reasons why failing over to another data centre or region would have caused data loss, and why it would have compromised the ability of related systems to function correctly.
- The media (at least what I read) was very understanding of the problem.
- The system worked exactly as intended: lightning caused cooling systems to fail, and the data centre performed an ordered shutdown to protect the data when the cooling system failed.
I am not employed by Microsoft, nor do I have any say in how they could mitigate a similar problem in the future. Insurance companies call weather-related incidents “acts of God” for a reason: they’re unpredictable in terms of effect and scope. (Aside: the largest supercomputers in the world are designed to predict weather patterns and they’re mostly still predictions.)
In the same way that air travel has improved dramatically over the last century from people learning from their mistakes, I am pleased to see companies like Microsoft taking advantage of acts of God to design more resilient systems, with a primary goal of protecting data.
I’m not sure if there is a perfect way to avoid a similar scenario in the future, but the way Microsoft shared this knowledge will help everyone design better systems going forward.
While this was unprecedented for Azure, the amount of extreme weather caused by climate change is only going to increase. Tropical storms, tornadoes, hurricanes, cyclones, and flooding are becoming more prevalent. We also have to keep in mind that old faithfuls like fire and earthquakes aren’t going away either.
When disaster strikes (whether literally as a bolt of lightning starting a fire, or the side-effect of a water-based fire-suppression system causing an electrical fire at your disaster recovery site), how do you know?
What about something a lot simpler, like a backup failure? Just this morning (at the time of this writing) I had a customer’s SQL Server differential backup fail, but I was not notified. Fortunately I was logged into the instance when it occurred and was able to react accordingly. The transaction log backups were still running, so there was no break in the backup chain, but I did need to make a modification on the server to deal with to the problem.
I’ll leave you with two questions:
Are your backup failure notifications working?
Are you sure?
Share your disaster recovery plans and stories below in the comments.