How I tackle disaster recovery

One of my special interests as an autistic person is understanding mechanical components of a computer, both analog and digital. In the olden days, we had devices known as hard drives which used one or more spinning disks and a read/write head that would move over the surface with a buffer of air so thin it equated to a 747 flying a handful of centimetres above the earth.

I’ve had several occasions where hard drives have failed and attempts to recover data from these wonders of mechanical engineering have been mostly fruitless. I’ve experienced profound examples of data loss, in both cases losing years of email and contact details for people I met online.

This is all to say that I care deeply about data loss, and I take it personally when I’m asked to engage with potential customers to recover data in SQL Server.

This post is a high-level overview of how I tackle data recovery, whether personally or for professional consulting reasons.

Step 1: Restore from last known good backup

For real, this is the first step. If you don’t have good backups (that you’ve tested), the price immediately goes up. To be clear, I don’t bill any time if I can’t recover any data, but if I can you’d better believe I’m trying everything I can, and that knowledge comes at a price.

Step 2: Restore from a previous older backup

If the most recent backup is no good, go back to one that does work. It’s still better than nothing.

In 2010 I worked on a very expensive data recovery project that started with recovering data from the degraded half of a RAID mirror. A power failure had taken out the server, causing catastrophic data loss. After some wishful thinking and a blood sacrifice to the hard drive gods (I always seem to cut my finger when working inside a computer), we discovered that the RAID had failed two months prior to the catastrophe, and because those drives weren’t spinning at the time the power failed, the clone we made of those drives allowed recovery from two months prior.

I am not exaggerating when I say this meant that the company could stay in business. In the last year or so I had to turn down a job where the company would close down if they couldn’t get back their data. Needless to say, they closed their doors. I’m good, but I’m not a magician.

Step 3: Get a clone of the drive and work with that

Data recovery is tricky. On a drive that is in the process of failing, where simply spinning it up could damage the disk’s surface and render it useless (experience talking again), you want to make a copy as quick as you can and work with the copy. Sometimes a clone is as simple as xcopy or Robocopy on Windows, or our old friend dd on Linux or UNIX. Either way, you want to get the data off the drive in order to examine it. Cloning software can do block copy (sector by sector, bit by bit), or at the file system level.

With a file system copy you are depending on the file system itself not being corrupt. If for example a computer virus has infected the host, a file system copy in a different machine is just fine. If it’s mechanical failure or corruption, you’ll want to do a block copy instead.

Step 4: Examine the bytes

This step ~~can be~~ is extremely time-consuming. A number of factors come into play, and recently (in the last decade or so) as encryption has become more commonplace, this might be the end of the line. For example, macOS has a feature called File Vault which does whole-drive encryption. Not even Apple keeps a copy of the private key, so if you’re locked out, you’re locked out.

However, if you do a clone of the drive and then mount that clone in a compatible machine so that the file system can be mounted, you might be able to decrypt it. Maybe. It depends. My rule of thumb is that any encryption that uses 128-bit keys or higher is out of my pay grade.

For unencrypted data, this is where the process gets interesting. I once had a drive I had partitioned, and somehow the partition table got corrupted. I phoned a friend who knew someone, who showed up at my gate with a box of diskettes. After booting into a command line environment, they were able to rebuild the partition table by examining the file allocation tables and work backwards to the partition boundary, looking for magic numbers on the disk surface.

Let’s talk about magic numbers quickly. This is a programming term that is usually derisive, but it means that certain software developers including Microsoft themselves write hard-coded values at the beginning of files (or boot sectors, or partition tables) that identify that data as being a certain format. Every single PDF document for example, starts with the same binary prefix. The old binary format of Office documents had their own unique magic number as well, before they changed to a regular zip file format in 2007. In fact, all zip files also have a magic number at the beginning. It’s the “secret” behind macOS being able to recognize so many file formats without having a known file extension. Microsoft’s philosophy for Windows is very different, where file types are explicitly spelled out in the file extension. But if you’re examining the raw bytes of a disk sector, you won’t necessarily know what file format you’re looking at unless you recognize the magic number.

It just so happens that I used to have encyclopedic knowledge of these magic numbers. Decades ago, when I was spending my time researching my special interest, I memorized these values and they just stuck.

Of course, finding the start of the file is only half the battle. Or a small percentage anyway. Windows is especially bad (or good, depending on your point of view) at fragmentation. In other words, depending on the block size you formatted your drive at (with 512-byte and 4-KB being the most common), you might have to go spelunking for all the parts of a file. With a corrupt file system, this might be really tricky and time consuming. Even worse: there’s no easy way to know if you’re looking at deleted data. When you delete a file, it just marks that section of the drive as free for use again. On more modern storage devices, this could entail zeroing out the cells before data is written again, but on spinning drives that block would stay there until the operating system decides to use that space again.

Step 5: Isolate the data

For SQL Server corruption, most times we already have the data files — or the backup — and we’re looking at logical corruption. If physical damage has occurred, we at least know in which data page(s) the corruption occurred. And, because SQL Server aligns its data files on 8-KB boundaries, we can isolate the corruption to exact locations on the drive surface. In other words, it turns out that data recovery in SQL Server is significantly easier than reading data off a cloned drive where the file system is corrupt.

Step 6: Recover the data … maybe

Assuming we’re in SQL Server and we’ve exhausted all other methods to reconstruct the data page (from non-clustered indexes, transaction logs, even Change Data Capture tables), we can look at the row(s) in question. Depending on where the corruption is, knowing what the structure of a data page is — not to mention the metadata around columns and data types — it is possible to have a full recovery of corrupt data.

Of course, that’s a lot of “it depends.” If the corruption is in system tables that describe the metadata, you’re out of luck. If it’s the boot page of the database, you’re out of luck. If the underlying RAID configuration is missing more than the number of drives required to maintain data consistency, you’re out of luck. I can examine bytes until the heat death of the universe, and it won’t make a difference if I can’t decipher parity bits.

Step 7: Pray

I have had some interesting engagements where I was able to perform full recovery, and others where companies went out of business. Others still were able to find old backups where corrupt segments of their database were overwritten with a hex editor, and it worked just fine.

This is a 1,500-word blog post to implore you to make sure you back up your data and test those backups within your recovery time objective (RTO). Data recovery is hit and miss. It’s expensive. It’s time consuming. It’s also very satisfying, but honestly not as satisfying as being able to restore your database using an automated process or getting the old data out of a temporal history table.

Share your disaster recovery stories and tips in the comments.

Photo by Michael Longmire on Unsplash.