I am still amused by terminology in the Information Technology field. Words like “Kubernetes,” “containers,” and the BASIC keywords PEEK
and POKE
, all bring a smile to my lips every time I read or say them.
Equally amusing are marketing ideas, except they may have a little sting at the end when they become mythology. This includes beliefs such as “we don’t need to test our backups,” or that “the Cloud” is somehow magical and solves all problems.
The sad, bad news is that myths like these are untrue. Testing backups matters. So too does your infrastructure, irrespective of how or where it has been configured.
At the end of the day, a bunch of companies build electronic components including motherboards, power supplies, central processing units (CPUs), random-access memory (RAM) chips, devices like hard drives and solid-state storage, network devices, and so on. Other companies assemble them, stick a label on them that says “Dell” or “IBM,” put them on ships and in aircraft, and send them to large data centres all over the world where they sit on racks in large cabinets, make a lot of noise, and require ongoing maintenance.
The nebulous “Cloud,” whether it is Microsoft’s Azure, Amazon’s AWS, or Google’s Compute (and others not worth mentioning in terms of market scale, but work the same way), features rows upon rows of these cabinets, powered and networked and cooled and managed by clever monitoring systems, and yet anything could cause them to fail. Recently, a Microsoft data centre was shut down after the cooling system was struck by lightning. A few months before that, Amazon lost a region due to a router misconfiguration. Expired SSL certificates have brought entire systems to their knees. Two years ago the Large Hadron Collider was shut down thanks to a weasel chewing through an important electrical cable.
When we design our systems for the Cloud, we have to take this sort of thing into account. I gave a talk last year at the first ever SQL Trail conference, and I made a point of stating that cloud computing requires an Internet connection. This may seem obvious as you read this in a web browser or in your RSS reader, but in some countries (including Australia), Internet connectivity and broadband are not a given.
Secondly and equally importantly, we have to design systems to handle failure. The exact implementation depends on the organization, but there must be some way to recover gracefully from a failure event, and failure events can take many forms. Earlier this year I wrote about system failure in the SQL Server book, which I am reproducing here:
Everything can fail. An outage might be caused by a failed hard drive, which could in turn be a result of excessive heat, excessive cold, excessive moisture, or a data center alarm that is so loud that its vibrational frequency damages the internal components and causes a head crash.
You should be aware of other things that can go wrong, as noted in the list below. This list is certainly not exhaustive, but it’s incredibly important to understand that assumptions about hardware, software and network stability are a fool’s errand.
- a failed network interface card
- a power surge or brownout causing a failed power supply
- a broken or damaged network cable
- a broken or damaged power cable
- moisture on the motherboard
- dust on the motherboard
- overheating caused by a failed fan
- a faulty keyboard that misinterprets keystrokes
- failure due to bit rot
- failure due to a bug in SQL Server
- failure due to poorly written code in a file system driver that causes disk corruption
- capacitors failing on the motherboard
- insects or rodents electrocuting themselves on components (this smells really bad)
- failure caused by a fire suppression system that uses water instead of gas
- misconfiguration of a network router causing an entire geographical region to be inaccessible
- failure due to an expired SSL certificate
- running a
DELETE
orUPDATE
statement without aWHERE
clause (human error)
Many of these failures are outside of the control of a cloud-based customer, so how do we work around them, especially the ones that can cause unknown corruption?
The answer has many layers, and often includes some form of monitoring system, but what are we doing to ensure the monitoring system doesn’t fail either? Netflix has their Chaos Monkey, an open-source tool that will randomly turn off servers in their production environment.
Share your thoughts about failure mitigation in this new hybrid world, in the comments below.