What is a good Disaster Recovery Plan?

During a Q&A session I hosted at our local Calgary SQL Server User Group last month, one of the attendees mentioned an interview question he’d had and wanted to hear my thoughts.

How would you design a disaster recovery solution for a 1 TB database, using Simple Recovery Model? Replication is not permitted, and switching to the Full Recovery Model is not permitted.

Reminder: Simple Recovery Model means no transaction log backups. Only full and differential backups are possible.

My first question to the attendee was what the Service Level Agreement (SLA) says. As we know from previous posts, a disaster recovery strategy is dictated by business requirements, not technical ones. The Recovery Point Objective (how much data loss is acceptable) and Recovery Time Objective (how much time there is to bring everything back) will guide my proposal.

He told me that the SLA was 24 hours, so I started writing on the white board while I was thinking aloud.

On average, a fast storage layer can read and write around 200 MB/s, so it would take 5.12 seconds to write 1 GB, or just under 85 minutes to restore the database back to disk, not counting the log file or crash recovery. I never assume that Instant File Initialization is enabled, plus I won’t know how big the transaction log file is going to be, and that needs to be zeroed out.

The first rule of designing a disaster recovery plan is to expect the worst. Imagine the building has burned down and there’s no Internet access. That means no new server, no operating system, no software installation media. It’s the worst it could be.

Since 85 minutes goes into 24 hours quite comfortably, I then started asking about network speed. Imagine that the backup file has to be copied over an Ethernet connection at (best case) gigabit speeds, which has a theoretical maximum of 125 MB/s. Given that “theoretical” word in there, I always take away 20% and I’m left with 100 MB/s. So assuming the backups are stored on a network device that can deliver our files at gigabit speeds, we’re looking at almost three hours to write the database files, at minimum.

Here we get into assumptions. We have to assume that new hardware can be provided within that 24 hour window. If I don’t have a server to install SQL Server onto, then I won’t be able to meet the SLA. If I don’t have Windows Server (or Linux!) to install onto the server, I won’t meet the SLA. If I cannot install the latest updates to Windows and SQL Server, I won’t meet the SLA.

Now we get into a more challenging aspect: how to take backups so that there’s the minimum amount of downtime and the maximum use of disk space for the backups.

With the Simple Recovery Model, we have our full database backups, which take the entire database and as much of the transaction log as is required to make the database transactionally consistent when restored. In other words, if there are any in-flight transactions that are committed by the time the backup finishes, those should roll forward during restore, and any that don’t commit by then are rolled back. SQL Server handles this for us during the backup itself, and again during the crash recovery phase, but a long-running transaction that has to roll back can delay startup, and I’ve seen cases where this can take several hours.

The other thing we can do with Simple Recovery Model, is differential backups. These are backups that keep track of all the extents in the database that were changed since the last full backup. In many cases, these differential backups will be significantly smaller than the full backup, so it becomes easier to keep a smaller set of files but still have a reasonably up-to-date backup.

Remember that a differential backup is not an incremental backup. If you do a differential backup every night, it will be bigger than it was the previous night.

Of course I’d recommend using backup compression, which has been a standard feature since SQL Server 2008 R2 in Standard Edition, and before that in Enterprise Edition. Compressed database backups take less time to back up and restore, but at a cost of higher CPU utilisation. I think this trade-off is acceptable.

I’d also seriously look at having the backups go offsite to Azure Blob Storage or Amazon S3. If you use Amazon Glacier, keep in mind that there is a built-in delay of four hours before the first file comes back from Glacier.

So, to summarise, if the building has burned down:

  • Do we have electricity?
  • Do we have a network?
  • Do we have a Storage Area Network (SAN)?
  • Do we have a server?
  • Do we have software?
  • Do we have Internet access?
  • Do we have a local copy of the latest backups?
  • Can we get a local copy of the latest backups on site within that 24-hour window?
  • Can we restore the backups and verify them (DBCC CHECKDB) within that 24-hour window?

Sometimes, if it’s a real disaster, the best thing might be to spin up a virtual machine in Azure, Google Compute, or Amazon Web Services, and restore the backup to there. If offsite backups are going to the cloud already, this might be the optimal disaster recovery plan in 2017. A new VM in Azure takes under half an hour to restore. Taking network speed into account, it might take three to six hours to restore the latest full and differential backup files, and then doing a DBCC CHECKDB to verify the database, which adds another two hours or so.

The attendee indicated that he would use SAN replication to bring the database back from the dead. My counter to that was that if it was a true disaster, there wouldn’t be a SAN at all.

If you have any thoughts about how you would tackle this scenario, find me on Twitter at @bornsql.

On Best Practices

We hear a lot about “best practice”, but what does it mean in the context of SQL Server? Wikipedia describes best practice as follows:

A best practice is a method or technique that has been generally accepted as superior to any alternatives because it produces results that are superior to those achieved by other means or because it has become a standard way of doing things, e.g., a standard way of complying with legal or ethical requirements.

We will take a high-level look at high availability and disaster recovery, an important part of database management and administration and a good example of how a certain way of doing things has become a standard. Then we will discuss what it means to implement a best practice for disaster recovery.

High Availability

There are many ways to achieve high availability (HA) — the idea that our database has zero (or close to zero) downtime — and usually this is dictated by our budget. We want business continuity in the face of planned and unplanned downtime. This may take the form of database mirroring, Availability Groups, clustering, or log shipping.

Disaster Recovery

Disaster recovery (DR) is slightly different. It assumes that a disaster has occurred, our highly-available environment has gone down, and now the clock is ticking. Company losses are mounting by the minute.

How this affects me, the DBA

HA/DR is usually expressed in a Service Level Agreement using two terms:

  • RPO, or recovery point objective: the maximum amount of data loss allowed by the business.
  • RTO, or recovery time objective: the maximum amount of time it should take to recover from a disaster.

What we want is a reliable, predictable and repeatable disaster recovery plan, even if we have a high availability solution that has never failed before. If our Availability Group doesn’t fail over cleanly, or the drive array is on fire, or that Windows Update managed to kill the network interface, we need to roll up our sleeves and activate the DR plan.

So, according to guidance, that means having regular full backups, differential backups, and transaction log backups ready to restore at a moment’s notice, right? Yes, but that’s not where our job stops.

Putting our best foot forward

This is where best practice comes in. It’s all very well having a good disaster recovery plan, but it’s useless if we don’t test it. Best practice in this case would be having both a good backup strategy and a methodology to make sure our DR plan is tested often, so that when something goes wrong, we have a script to follow. That way we will be able to recover with a small RPO and a short RTO, so that business is impacted as little as possible.

We call this a Run Book. Anyone in our organisation should be able to follow the script and perform a successful recovery.

What if we are incapacitated in some way, or in my case recently, stuck on an aircraft with no access to the Internet? What if the only person who is able to restore the environment is the CEO after hearing that her building is on fire?

Our backups must not only be stored securely somewhere (and in my post last week, I provided a budget-friendly solution on how to do off-site backups), but we should be able to restore to a known point in time, within the recovery time objective, losing as little data as possible.

Our run book should be clear enough so that the person who is testing the disaster recovery plan understands it so as not to ask us questions.

Earlier we saw the word “superior” repeated in the definition of best practice. A superior solution doesn’t need to be the most expensive one, using the latest and greatest tools. A run book could be an Excel file containing enough information to rebuild a server and locate magical scripts in the cloud that are tested weekly.

Know the problem we’re trying to solve. Build a solution that is practical and cost-effective for our environment. Test and iterate until it is flawless. That’s best practice.