On the importance of good backups

I’ve been vocal for some years about the importance of backups, and I have zero sympathy for anyone who does not have good backups*.

This comes with (very) painful experience: a few years ago, during the migration of data from several drives to my new computer at the time, I lost the drive everything was copied to, before I made a backup. It was not recoverable:

Unfortunately we were not able to recover any data from the Seagate hard drive you sent us. The drive arrived to us with what looked like a bad voice coil. We were able to verify the circuit board and voice coil were operational. We opened the hard drive in a clean room and discovered massive damage to all platters and heads from a skipping head. We can see millions of pot marks on the platters from a skipping head. The internal white filter is black with platter debris. The desiccant packet inside the drive shows that it absorbed a lot of water causing black blotches on the outside. This drive was turned on after the initial failure for a very long time to create the damage. We typically see this type of damage when someone puts the drive in and out of a freezer several times. This would not be a recoverable drive.

(Thanks to Robert Fovall from lowcostrecovery.com for taking a look. If you’re interested, they have the best rates and service out there.)

Tonight, to participate in the latest survey from Paul Randal (of SQLskills.com), I ran a query on one of my client’s servers. While I was signed in, I decided to do a checkup on disk space and backups, because, you know, backups.

I noticed that the data drive on the server was using over 500GB for the backup folder (which is copied every 15 minutes to a separate machine, with a removable encrypted drive, swapped out every week, etc., etc.)

The problem is that I know the backup drive has a 500GB capacity. Strangely, when I checked this drive last week, it hadn’t run out of space. So someone missed it on their daily checklist (or Run Book). I’ll have to figure that out in the morning.

More importantly, the SQL Server backups were not being copied off the SQL Server machine every fifteen minutes. This is as bad as not having a backup at all. Should the RAID fail, I’d lose everything newer than the latest backup, which for all I know, could be a week old.

So I decided to delete some of the old backup files. Normally I recommend not taking this course of action unless absolutely necessary. Our plan is to retain three months of backups for this server, and there just isn’t enough space on the backup drive. So we’ll have to modify the plan and update the Run Book to take this into account.

But here’s the fun part: I decided to use my existing recovery scripts to do a test-restore of the largest database, and guess what? The scripts were wrong. There was a small thing, easily fixed by me, but for someone who doesn’t know SQL Server, and can’t read error messages? World-ending (or in this case, medical clinic-closing).

I was able to attend to a disk space shortage, test my restore scripts, and was found wanting. So this is a reminder that even your perfectly-laid plans must be checked and re-checked periodically.

* A good backup is one that can be restored. I know it sounds obvious, but this is why you test them, right? Right?

Photo by Nathan Dumlao on Unsplash.