The perils of RAID

Recently, I was asked to assist an organization with getting their data back for a SQL Server that had experienced physical hard drive failure.

I asked a lot of questions about the configuration of the data layer, and it came to light that it was a fairly regular setup. SQL Server is virtualised on VMWare, with a VMDK file that contains the data drive.

This file was stored on a RAID 50 array (also known as RAID 5+0), where two drives had failed at roughly the same time. Depending on which drives in the respective logical groups had failed, there was a small chance the data was recoverable, but I was skeptical.

The representative had managed to somehow mount the drives so that he could recover the VMDK file, mount it as a drive, and pull files from it to try and recover the SQL Server database.

Coincidentally, I had recently written about recovering from catastrophic data loss, so the memory was still fresh in my mind about how the RAID files that had been copied from the disks were corrupt.

One of the most talented data recovery people I know was already working on trying to recover the database, and his description of what he’d attempted already was pointing to an unrecoverable RAID failure.

We took a look through some random files on the disk. I was looking for evidence that the VMDK file that had been copied was taken from a RAID array in a corrupt state, namely that one of the disks had failed, and that the second disk had failed during the rebuild of the array.

The easiest way to see this is to look for a JPEG or PNG file over 256 KB in size. Most RAID block sizes are multiples of 64 KB, usually 128 KB or 256 KB. Each block is split over the individual physical disks, with a parity bit, so for a particular block of data, if the RAID array has failed, you will see a corrupt image, or the image won’t be viewable at all.

There were no images that were large enough, so we looked for the next best thing: SQL Server’s ERRORLOG files. By default, there are up to six files, and each of them will usually contain more than 256 KB of data. A disaster at the RAID controller level would present as garbage data at a 64 KB offset somewhere in the file.

Sure enough, after scrolling through the first file we opened, there was a large chunk of random data in the file, running for 32 KB. I didn’t need to see any more to know that this array was not recoverable.

In this case, two disks went away in one of the RAID 5 sets at the same time. Two is effectively a catastrophic failure because there isn’t enough information, even with the parity bit distributed across drives, to rebuild one logical side of the RAID 50 stripe.

The representative was advised to send the drives away for data recovery. While I don’t expect them to have any luck in recovering the data, at least we were able to demonstrate the problem visually. If a text file looked like that, there was no chance we could be expected to safely recover a SQL Server database.

The moral of the story is to always back up your database, and make sure that it is securely copied off-site as soon as it is completed. Make sure you test that backup, and make sure you run a DBCC CHECKDB on the restored database. Use a backup checksum as well, since not even CHECKDB can catch all corruption.

Share your corruption story with me on Twitter, at @bornsql.