During a Q&A session I hosted at our local Calgary SQL Server User Group last month, one of the attendees mentioned an interview question he’d had and wanted to hear my thoughts.
How would you design a disaster recovery solution for a 1 TB database, using Simple Recovery Model? Replication is not permitted, and switching to the Full Recovery Model is not permitted.
Reminder: Simple Recovery Model means no transaction log backups. Only full and differential backups are possible.
My first question to the attendee was what the Service Level Agreement (SLA) says. As we know from previous posts, a disaster recovery strategy is dictated by business requirements, not technical ones. The Recovery Point Objective (how much data loss is acceptable) and Recovery Time Objective (how much time there is to bring everything back) will guide my proposal.
He told me that the SLA was 24 hours, so I started writing on the white board while I was thinking aloud.
On average, a fast storage layer can read and write around 200 MB/s, so it would take 5.12 seconds to write 1 GB, or just under 85 minutes to restore the database back to disk, not counting the log file or crash recovery. I never assume that Instant File Initialization is enabled, plus I won’t know how big the transaction log file is going to be, and that needs to be zeroed out.
The first rule of designing a disaster recovery plan is to expect the worst. Imagine the building has burned down and there’s no Internet access. That means no new server, no operating system, no software installation media. It’s the worst it could be.
Since 85 minutes goes into 24 hours quite comfortably, I then started asking about network speed. Imagine that the backup file has to be copied over an Ethernet connection at (best case) gigabit speeds, which has a theoretical maximum of 125 MB/s. Given that “theoretical” word in there, I always take away 20% and I’m left with 100 MB/s. So assuming the backups are stored on a network device that can deliver our files at gigabit speeds, we’re looking at almost three hours to write the database files, at minimum.
Here we get into assumptions. We have to assume that new hardware can be provided within that 24 hour window. If I don’t have a server to install SQL Server onto, then I won’t be able to meet the SLA. If I don’t have Windows Server (or Linux!) to install onto the server, I won’t meet the SLA. If I cannot install the latest updates to Windows and SQL Server, I won’t meet the SLA.
Now we get into a more challenging aspect: how to take backups so that there’s the minimum amount of downtime and the maximum use of disk space for the backups.
With the Simple Recovery Model, we have our full database backups, which take the entire database and as much of the transaction log as is required to make the database transactionally consistent when restored. In other words, if there are any in-flight transactions that are committed by the time the backup finishes, those should roll forward during restore, and any that don’t commit by then are rolled back. SQL Server handles this for us during the backup itself, and again during the crash recovery phase, but a long-running transaction that has to roll back can delay startup, and I’ve seen cases where this can take several hours.
The other thing we can do with Simple Recovery Model, is differential backups. These are backups that keep track of all the extents in the database that were changed since the last full backup. In many cases, these differential backups will be significantly smaller than the full backup, so it becomes easier to keep a smaller set of files but still have a reasonably up-to-date backup.
Remember that a differential backup is not an incremental backup. If you do a differential backup every night, it will be bigger than it was the previous night.
Of course I’d recommend using backup compression, which has been a standard feature since SQL Server 2008 R2 in Standard Edition, and before that in Enterprise Edition. Compressed database backups take less time to back up and restore, but at a cost of higher CPU utilisation. I think this trade-off is acceptable.
I’d also seriously look at having the backups go offsite to Azure Blob Storage or Amazon S3. If you use Amazon Glacier, keep in mind that there is a built-in delay of four hours before the first file comes back from Glacier.
So, to summarise, if the building has burned down:
- Do we have electricity?
- Do we have a network?
- Do we have a Storage Area Network (SAN)?
- Do we have a server?
- Do we have software?
- Do we have Internet access?
- Do we have a local copy of the latest backups?
- Can we get a local copy of the latest backups on site within that 24-hour window?
- Can we restore the backups and verify them (DBCC CHECKDB) within that 24-hour window?
Sometimes, if it’s a real disaster, the best thing might be to spin up a virtual machine in Azure, Google Compute, or Amazon Web Services, and restore the backup to there. If offsite backups are going to the cloud already, this might be the optimal disaster recovery plan in 2017. A new VM in Azure takes under half an hour to restore. Taking network speed into account, it might take three to six hours to restore the latest full and differential backup files, and then doing a DBCC CHECKDB to verify the database, which adds another two hours or so.
The attendee indicated that he would use SAN replication to bring the database back from the dead. My counter to that was that if it was a true disaster, there wouldn’t be a SAN at all.
If you have any thoughts about how you would tackle this scenario, find me on Twitter at @bornsql.