We hear a lot about “best practice”, but what does it mean in the context of SQL Server? Wikipedia describes best practice as follows:
A best practice is a method or technique that has been generally accepted as superior to any alternatives because it produces results that are superior to those achieved by other means or because it has become a standard way of doing things, e.g., a standard way of complying with legal or ethical requirements.
We will take a high-level look at high availability and disaster recovery, an important part of database management and administration and a good example of how a certain way of doing things has become a standard. Then we will discuss what it means to implement a best practice for disaster recovery.
High Availability
There are many ways to achieve high availability (HA) — the idea that our database has zero (or close to zero) downtime — and usually this is dictated by our budget. We want business continuity in the face of planned and unplanned downtime. This may take the form of database mirroring, Availability Groups, clustering, or log shipping.
Disaster Recovery
Disaster recovery (DR) is slightly different. It assumes that a disaster has occurred, our highly-available environment has gone down, and now the clock is ticking. Company losses are mounting by the minute.
How this affects me, the DBA
HA/DR is usually expressed in a Service Level Agreement using two terms:
- RPO, or recovery point objective: the maximum amount of data loss allowed by the business.
- RTO, or recovery time objective: the maximum amount of time it should take to recover from a disaster.
What we want is a reliable, predictable and repeatable disaster recovery plan, even if we have a high availability solution that has never failed before. If our Availability Group doesn’t fail over cleanly, or the drive array is on fire, or that Windows Update managed to kill the network interface, we need to roll up our sleeves and activate the DR plan.
So, according to guidance, that means having regular full backups, differential backups, and transaction log backups ready to restore at a moment’s notice, right? Yes, but that’s not where our job stops.
Putting our best foot forward
This is where best practice comes in. It’s all very well having a good disaster recovery plan, but it’s useless if we don’t test it. Best practice in this case would be having both a good backup strategy and a methodology to make sure our DR plan is tested often, so that when something goes wrong, we have a script to follow. That way we will be able to recover with a small RPO and a short RTO, so that business is impacted as little as possible.
We call this a Run Book. Anyone in our organisation should be able to follow the script and perform a successful recovery.
What if we are incapacitated in some way, or in my case recently, stuck on an aircraft with no access to the Internet? What if the only person who is able to restore the environment is the CEO after hearing that her building is on fire?
Our backups must not only be stored securely somewhere (and in my post last week, I provided a budget-friendly solution on how to do off-site backups), but we should be able to restore to a known point in time, within the recovery time objective, losing as little data as possible.
Our run book should be clear enough so that the person who is testing the disaster recovery plan understands it so as not to ask us questions.
Earlier we saw the word “superior” repeated in the definition of best practice. A superior solution doesn’t need to be the most expensive one, using the latest and greatest tools. A run book could be an Excel file containing enough information to rebuild a server and locate magical scripts in the cloud that are tested weekly.
Know the problem we’re trying to solve. Build a solution that is practical and cost-effective for our environment. Test and iterate until it is flawless. That’s best practice.