This is how I recovered from a catastrophic failure

I was fresh off the boat* from South Africa, working in a small computer store in an equally small town in Saskatchewan. Five days a week in winter, six days a week in summer, mainly removing malware from laptops, selling new computers (which is why I’m happy to recommend Asus and MSI to this day).

One of the side effects of having an MCSE in my back pocket back then was that I was occasionally pulled in to help the boss, Bill, with bigger jobs.

A notice went out to the entire city from the local electricity provider that they would be performing emergency maintenance on one section of the grid, and that all sensitive electronic devices should be shut down and unplugged from the wall, just to be safe.

A large client of ours, who had only recently signed with us, heeded this warning. They had unplugged every single desktop and laptop in the building. They were conscientious to a fault. The fault here being that they forgot about a Windows Small Business Server in a custom closet that had been specially built.

This server did not have any backups. I know this because the backup hardware was on backorder for them.

We discovered this after a panicked phone call to our shop in the morning after the power notice had gone out. No one could connect to their email (don’t you love how it’s always email?).

Bill and I got in his truck and off we drove, speculating that they had forgotten about the server. We arrived, and I beelined to the custom closet. The client’s most senior board member was there, as was the head of HR, who also happened to be head of IT. Small town Saskatchewan works that way.

After taking photographs of the server in situ, we unplugged and removed it from the closet and I removed the side cover.

Many of you reading this know the smell I’m about to describe. Ozone has a very distinctive odour.

I used a flashlight to look into the power supply unit, because that’s where the smell was coming from. It wasn’t pretty, but a burnt out PSU isn’t meant to be pretty.

I showed Bill what I saw, bearing in mind two high level folks from the client were watching our every move. I made sure to document it all while I went, mainly with photographs.

The inevitable question came from the big boss: “So how bad is it?”

Being seven years ago now, I don’t remember my exact wording, but I looked him in the eye and I said, “There are two words I never like to say together in the same sentence, when it comes to computers. Those words are ‘catastrophic failure’. We will have to take this back to the shop and pull it apart to figure out what happened, but I want you to understand that there is a significant chance that I will be unable to recover any data.”

In that moment I understood a little bit of how emergency room doctors feel when they give bad news. The client knew they would go out of business because of this data loss.

We reattached the side of the server, stuck it in Bill’s truck and went back to the office.

I cleared some room on my desk and set to work, documenting the removal of hardware. There were four drives in the server. Based on the smell alone, I believed that all four drives were irrecoverable, but I had to prove it. Insurance would cover hardware replacement for the client, but we don’t make assumptions about the lifeblood, which is their data.

I want to stop here and talk about dust. Saskatchewan is dusty because it’s flat and there’s a lot of wind and open space. How windy? I have had metal garden furniture flying off my deck.

The server interior was so dusty that the canned air I used on it wasn’t enough. I had to use a static-free cloth to wipe down the components I removed.

The first order of business was to clone the hard drives so that I could work off the clones to attempt data recovery. There was no way of knowing if they’d spin up, and I had no idea of knowing how they’d been configured. The server was not even starting up.

Once the drives were cloned, I replaced the power supply unit and plugged in the cloned drives to the motherboard, to see if I could at least establish the RAID configuration.

There was a warning that the RAID mirror was broken, which initially I thought was because I’d pulled the original drives, but the date didn’t match. The error was showing two months prior, long before we had been asked to look after this new client.

After acknowledging the error, I looked at the RAID configuration. Two separate RAID 1 mirrors, for a C: and D: drive. Well, that was some good news at least. If the drives were striped, my recovery options were close to zero. At least mirrored drives might have some opportunity to recover.

By now I had the cloned drives plugged in and configured in the same way as the originals, just updating the devices in the BIOS.

The server didn’t boot. No surprise there. I tried a few more combinations and nothing worked.

Then, I decided to try one of each drive, with no mirrors in place, and see what that did. After the third combination, the server started up.

Windows Small Business Server 2007. It said Safe Mode on it, but it was a running operating system.

And a login screen.

I logged in, expecting the worst. That’s exactly what I got. Files were missing or corrupt. SharePoint wasn’t working. Exchange Server wasn’t working. (Safe Mode, remember?) It was a mess.

Then I did what anyone would do in this situation (well, I’d do it again if I was in this situation). I turned it off, removed the cloned drives, and stuck the original drives back in, just with the new combination I’d discovered with the cloned drives.

It booted again. The drives spun. The Windows Safe Mode screen was still there, but it booted. I logged in, and stuff was there. It was there, but it was two months old.

Yup, I had managed to somehow get the degraded side of the RAID mirror working. Given the risks, I began a disk2vhd across the network, cloning the drive that way.

Once I had the data in a VHD, I did a full hardware and software check of the drives. The degraded mirror drives, the ones that failed two months prior, were fine. We had, effectively, a backup from two months prior.

There is a lot more to say about rebuilding an entire network because the Active Directory portion of the data was irrecoverable. And about how I extracted PSTs from the corrupt Exchange Server database. But that’s for a much later post. I still have flashbacks.

For the record, the new server and two weeks of my time to rebuild everything for the new client was not cheap. Amusingly, the backup hardware we had on backorder arrived during that first week.

I’m not a hero. I was lucky. The client treated me like a hero, but I was lucky. There was no good reason for this to have worked.

In my postmortem, I figured that the power supply unit had blown up, and taken the “good” side of each RAID mirror with it. I believe that the degraded drives were just lucky to not get nuked by the power spike (maybe the other drives took the brunt of it). There was still some obvious corruption though, which is probably what caused the mirror to degrade in the first place. Active Directory was gone, some email was gone (not to mention two months’ worth of new data).

Test your backups. Please.

  • It was a plane, and I was about three months in town already.