Does rebuilding my clustered index also rebuild my non-clustered indexes?

I’ve been working with SQL Server for many years now, and up until recently, I assumed that rebuilding any clustered index would cause non-clustered indexes to be rebuilt as well, because the non-clustered index includes the clustered index in it.

This assumption is wrong.

On SQL Server 2000, this only used to affect non-unique clustered indexes because the uniquifier might change. I was minding SQL Server 2000 instances for a long while after it was no longer supported by Microsoft, which is why this myth stuck around in my head for so long.

On SQL Server 2005 and higher, non-clustered indexes are not rebuilt when a clustered index is rebuilt.

As my friend Gail Shaw says in this forum post (from 2011, no less!):

It’ll have absolutely no effect on the nonclustered indexes. Nonclustered indexes use the clustered index key as a ‘pointer’ and that doesn’t change in a rebuild.

To summarize:

  1. A non-clustered index is rebuilt if the clustered index is dropped and recreated. Without a clustered index, the non-clustered indexes will have to refer to the row identifier (RID) in the underlying heap instead.
  2. A non-clustered index is not rebuilt if a clustered index is rebuilt, on SQL Server 2005 and higher.

Photo by 贝莉儿 NG on Unsplash

A trillion and one

Joe Obbish wrote an epic post a few weeks ago about loading a trillion rows into a non-partitioned table in SQL Server, using a clustered columnstore index to maximise the compression.

(Short version: it’s very slow to query. Don’t do it. Bad things happen. I have an ongoing investigation with Ewald Cress about the evil wait type involved, which Joe noted in his original post. None of what I do is original.)

So I decided to repeat his experiment as well, mainly to see if I could. There is no limit to the number of rows you can store in a table in SQL Server (except in terms of disk space), so I replicated Joe’s experiment and … well, it was practically the same.

Because I’m petty, I inserted one trillion and one rows into my table. I currently hold the record, I guess?

My particular instance was an Ubuntu 16.04 LTS virtual machine, running on ESXi 6.5. The VM had 40GB of RAM and eight CPU cores assigned, which meant I was really being unfair on my quad-core Xeon CPU. The version of SQL Server was, naturally, 2017 for Linux, and I was using the latest release candidate, RC2. Max Server Memory was 32GB, Cost Threshold for Parallelism was 50, and MAXDOP was 0.

Also, I disabled auto-create and auto-update for statistics. In retrospect, this was a mistake, but I live with my mistakes and learn from them, et cetera.

One observation I have is that Linux appears to be marginally quicker than Windows Server on the same hardware, particularly around disk I/O. I can’t give you reasonable numbers because my NVMe drive has zero latency. Literally. At that level, any number greater than zero is an outlier, and I haven’t had time to do a proper statistical analysis.

During the five days it took (I did the trillion row insert twice, and the second time was much quicker), the latency did climb up to 6.6 ms for log writes at one stage.

What this really tells me is nothing at all. A trillion inserts was CPU- and memory-bound because the high compression of the clustered columnstore index meant very few disk writes were actually performed (notwithstanding the transaction log). Like Joe, my table was less than 1GB in size.

On the other hand, I managed to get a sustained speed of 8.4 million inserts per second, which is quite impressive.

SQL Server is definitely capable of taking a lot of abuse.

If you have any million-million-row tables you want to talk about, let’s chat on Twitter at @bornsql.

Connect Windows 10 client to a VPN on Windows Server 2012 R2

I got a strange request in a Slack channel the other day. A colleague in South Africa, who uses Windows, was unable to connect to our VPN (Virtual Private Network).

We use the built-in VPN on Windows Server 2012 R2, which makes it extremely convenient to manage per-user security without opening up the firewall for the entire world to connect to the server with RDP (Remote Desktop Protocol).

The reason this is a strange request is that I have a MacBook Pro, and creating a VPN connection couldn’t have been easier for me. Another colleague also has an Apple laptop, so of course they’re not affected either.

After searching online, I discovered this inconspicuous post on Spiceworks.

The answer is to create a key in the Windows 10 client registry, that enables IPSec NAT Traversal (I know, right?).

The Spiceworks post links to Microsoft KB article 926179, which says to add the following key to your Windows 10 registry:

  • Branch: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\PolicyAgent
  • Key: AssumeUDPEncapsulationContextOnSendRule
  • Type: DWORD (32-bit)
  • Value: 2

According to the KB article, a value of 2 “configures Windows so that it can establish security associations when both the server and the […] client computer are behind NAT devices.

This same registry tweak works on Windows Vista. Given that it also works on a Windows 10 client, it seems safe to assume that it works on Windows 7 and 8.x. There is even a similar registry entry for Windows XP. Go figure.

Yet another problem solved by someone else. Thanks, Gareth4146.

Photo by Robert Hickerson on Unsplash

What you need to know about memory limits on SQL Server

Last week, I posted this statement on Twitter, along with a screen capture of the official Microsoft documentation:

Reminder: Max Server Memory is *not* just for the buffer pool. The RAM limit Standard Edition can use *is* just for the buffer pool.

Joey D’Antoni was quick to remind me that this only applies to SQL Server 2012 and higher, which I gratefully acknowledge. For the sake of this post, I am talking specifically about SQL Server 2012 and higher.

Let’s break this down.

On all editions of SQL Server (Enterprise, Standard, Web, Express), there is a configuration setting called Max Server Memory. This value should not be left at the default 2.1 petabytes, because at that value SQL Server will use all the memory you allow it (plus a little bit more if it needs to), and once it has it, it doesn’t release it back to the operating system without a fight.

What does Max Server memory control, though? According to the documentation, it’s the buffer pool, compilation memory, all the caches including the plan cache, and a bunch of other stuff. Here’s a quote:

Max Server Memory controls the SQL Server memory allocation, including the buffer pool, compile memory, all caches, qe memory grants, lock manager memory, and clr memory (essentially any memory clerk found in sys.dm_os_memory_clerks). Memory for thread stacks, memory heaps, linked server providers other than SQL Server, and any memory allocated by a non SQL Server DLL are not controlled by max server memory.

The main takeaway here is that Max Server Memory is not limited to the buffer pool. There’s a lot of other stuff going on too.

Now let’s consider the RAM limit of all non-Enterprise editions of SQL Server, as specified in the license agreement. At the time of this writing, using Standard Edition as an example, the license has a RAM limit of 128GB, which is specifically the maximum memory available for the buffer pool only. (This clarification in the documentation occurred after the release of SQL Server 2016 Service Pack 1 last year, but the license limit has referred to the buffer pool since at least SQL Server 2012.)

This means that in a server with Standard Edition and a lot of RAM, you could set the Max Server Memory configuration setting higher than 128GB comfortably, and SQL Server will use more than 128GB perfectly legally, because there’s a lot more to Max Server Memory than just the buffer pool.

So when I read in forums and elsewhere that the license limit of 128GB for Standard Edition is the highest you can set Max Server Memory, this is demonstrably wrong.

To make this more interesting, SQL Server 2016 and higher supports ColumnStore and In-Memory OLTP in non-Enterprise editions anyway, which means that you can easily consume all the physical RAM on a server, even if your buffer pool is using the legal limit.

Tell me your licensing story on Twitter at @bornsql.

Microsoft presents: MySQL and PostgreSQL?

For the longest time, MySQL has been the recommended database platform for blogs and other websites. It’s free. It’s available on every version of Linux, which is the most common web server platform. If you need something more complex but still free, there’s PostgreSQL.

But there’s a lot going on beyond that word “free”. Because although the MySQL or PostgreSQL licence might be free, there are hidden costs to running a website these days.

Apart from the security aspect, which alone should make us want to turn off our computers and walk away, we need to keep backups, and if something goes wrong, we need to be able to restore those backups. We need to test those backups. We need to keep the server patched. We need to keep the server running at all times.

If we run a hosting company, and have paying customers, we might be able to roll that administrative overhead into our hosting fee, but it’s hard to remain competitive and still make money in this game. So we stack our customers, and things slow down because it’s only cost effective to have hundreds of customers on a single server. It increases risk because so many more customers might be affected by even planned maintenance. If your customers are global, how do you plan downtime that suits all of them?

This Azure thing starts to look appealing. Perhaps an Azure SQL Database on the Basic tier for 5 DTUs at $6 a month (Canadian) would be able to bring the power of SQL Server to blogging engines without breaking the bank. Who can’t afford $6 a month in this day and age to host a website backed by the best and most secure RDBMS around?

I’d get acceptable DTU performance for the low traffic most hosted websites get and all the benefits, including SSL encryption, fully managed, point-in-time backups, plus the very latest features. No need to maintain a separate server, or set up maintenance jobs to keep the site and its contents safe. No need to worry about testing my backups through phpMyAdmin, which, let’s face it, isn’t as great as SQL Server Management Studio.

But we already know MySQL. It has been around for ages. Why switch to a Microsoft technology?

Microsoft seems to have asked themselves that same question, probably because they understand we’ve all built websites on MySQL. One of my favourite customers uses a LAMP stack (Linux, Apache, PHP and MySQL) for their site, and they are happy. Sure they could move to SQL Server, or Azure SQL Database, but then I’d have to rewrite a whole bunch of things. There’s no such thing as a drop-in replacement for any RDBMS, no matter what the marketing folks tell you. Well, unless you’re migrating from SQL Server to Azure SQL Database. And even then, there are challenges.

So now Microsoft has this thing where you can host your MySQL (or PostgreSQL) database on the exact same management framework as Azure SQL Database. It’s identical. The only difference is the database engine.

What does that mean? How about acceptable transactional throughput for the low traffic most hosted websites get, and all the benefits, including SSL encryption, fully managed, point-in-time backups, plus the very latest features? (Sound familiar?)

Instead of hosting a Linux-based, self-managed and maintained server with lots of moving parts, you can decouple your website and your database, and run the database on the same RDBMS you know and love (MySQL or PostgreSQL), with all the care and feeding taken care for you, starting at $20 a month and including 50 GB of space.

It’s good to have options. Fully patched. Fully maintained. Fully backed up. Fully secured.

This is what you can get, today, on Azure. The choice is yours:

  • Azure SQL Database, compatible with SQL Server;
  • Azure Cosmos DB, a global-scale eventually-consistent 99.99%-uptime guaranteed distributed database;
  • Azure Database for MySQL; or
  • Azure Database for PostgreSQL.

If your platform of choice isn’t available as a Platform-as-a-Service solution, you can still spin up a VM template that contains your Oracle and SAP HANA solutions.

The future is looking pretty rosy in Azure.

Talk to me on Twitter at @bornsql about your favourite database platform.

This is how I recovered from a catastrophic failure

I was fresh off the boat* from South Africa, working in a small computer store in an equally small town in Saskatchewan. Five days a week in winter, six days a week in summer, mainly removing malware from laptops, selling new computers (which is why I’m happy to recommend Asus and MSI to this day).

One of the side effects of having an MCSE in my back pocket back then was that I was occasionally pulled in to help the boss, Bill, with bigger jobs.

A notice went out to the entire city from the local electricity provider that they would be performing emergency maintenance on one section of the grid, and that all sensitive electronic devices should be shut down and unplugged from the wall, just to be safe.

A large client of ours, who had only recently signed with us, heeded this warning. They had unplugged every single desktop and laptop in the building. They were conscientious to a fault. The fault here being that they forgot about a Windows Small Business Server in a custom closet that had been specially built.

This server did not have any backups. I know this because the backup hardware was on backorder for them.

We discovered this after a panicked phone call to our shop in the morning after the power notice had gone out. No one could connect to their email (don’t you love how it’s always email?).

Bill and I got in his truck and off we drove, speculating that they had forgotten about the server. We arrived, and I beelined to the custom closet. The client’s most senior board member was there, as was the head of HR, who also happened to be head of IT. Small town Saskatchewan works that way.

After taking photographs of the server in situ, we unplugged and removed it from the closet and I removed the side cover.

Many of you reading this know the smell I’m about to describe. Ozone has a very distinctive odour.

I used a flashlight to look into the power supply unit, because that’s where the smell was coming from. It wasn’t pretty, but a burnt out PSU isn’t meant to be pretty.

I showed Bill what I saw, bearing in mind two high level folks from the client were watching our every move. I made sure to document it all while I went, mainly with photographs.

The inevitable question came from the big boss: “So how bad is it?”

Being seven years ago now, I don’t remember my exact wording, but I looked him in the eye and I said, “There are two words I never like to say together in the same sentence, when it comes to computers. Those words are ‘catastrophic failure’. We will have to take this back to the shop and pull it apart to figure out what happened, but I want you to understand that there is a significant chance that I will be unable to recover any data.”

In that moment I understood a little bit of how emergency room doctors feel when they give bad news. The client knew they would go out of business because of this data loss.

We reattached the side of the server, stuck it in Bill’s truck and went back to the office.

I cleared some room on my desk and set to work, documenting the removal of hardware. There were four drives in the server. Based on the smell alone, I believed that all four drives were irrecoverable, but I had to prove it. Insurance would cover hardware replacement for the client, but we don’t make assumptions about the lifeblood, which is their data.

I want to stop here and talk about dust. Saskatchewan is dusty because it’s flat and there’s a lot of wind and open space. How windy? I have had metal garden furniture flying off my deck.

The server interior was so dusty that the canned air I used on it wasn’t enough. I had to use a static-free cloth to wipe down the components I removed.

The first order of business was to clone the hard drives so that I could work off the clones to attempt data recovery. There was no way of knowing if they’d spin up, and I had no idea of knowing how they’d been configured. The server was not even starting up.

Once the drives were cloned, I replaced the power supply unit and plugged in the cloned drives to the motherboard, to see if I could at least establish the RAID configuration.

There was a warning that the RAID mirror was broken, which initially I thought was because I’d pulled the original drives, but the date didn’t match. The error was showing two months prior, long before we had been asked to look after this new client.

After acknowledging the error, I looked at the RAID configuration. Two separate RAID 1 mirrors, for a C: and D: drive. Well, that was some good news at least. If the drives were striped, my recovery options were close to zero. At least mirrored drives might have some opportunity to recover.

By now I had the cloned drives plugged in and configured in the same way as the originals, just updating the devices in the BIOS.

The server didn’t boot. No surprise there. I tried a few more combinations and nothing worked.

Then, I decided to try one of each drive, with no mirrors in place, and see what that did. After the third combination, the server started up.

Windows Small Business Server 2007. It said Safe Mode on it, but it was a running operating system.

And a login screen.

I logged in, expecting the worst. That’s exactly what I got. Files were missing or corrupt. SharePoint wasn’t working. Exchange Server wasn’t working. (Safe Mode, remember?) It was a mess.

Then I did what anyone would do in this situation (well, I’d do it again if I was in this situation). I turned it off, removed the cloned drives, and stuck the original drives back in, just with the new combination I’d discovered with the cloned drives.

It booted again. The drives spun. The Windows Safe Mode screen was still there, but it booted. I logged in, and stuff was there. It was there, but it was two months old.

Yup, I had managed to somehow get the degraded side of the RAID mirror working. Given the risks, I began a disk2vhd across the network, cloning the drive that way.

Once I had the data in a VHD, I did a full hardware and software check of the drives. The degraded mirror drives, the ones that failed two months prior, were fine. We had, effectively, a backup from two months prior.

There is a lot more to say about rebuilding an entire network because the Active Directory portion of the data was irrecoverable. And about how I extracted PSTs from the corrupt Exchange Server database. But that’s for a much later post. I still have flashbacks.

For the record, the new server and two weeks of my time to rebuild everything for the new client was not cheap. Amusingly, the backup hardware we had on backorder arrived during that first week.

I’m not a hero. I was lucky. The client treated me like a hero, but I was lucky. There was no good reason for this to have worked.

In my postmortem, I figured that the power supply unit had blown up, and taken the “good” side of each RAID mirror with it. I believe that the degraded drives were just lucky to not get nuked by the power spike (maybe the other drives took the brunt of it). There was still some obvious corruption though, which is probably what caused the mirror to degrade in the first place. Active Directory was gone, some email was gone (not to mention two months’ worth of new data).

Test your backups. Please.

  • It was a plane, and I was about three months in town already.

Locks, Blocks, and Isolation Levels

Last week we looked at ACID compliance. This week we dive a little deeper into the Isolation portion, and what it all means.

Isolation levels work to manage how my transactions may be isolated from your transactions, to allow concurrency. They work by locking sections of the database until a transaction is committed or rolled back, which results in blocking other access to those sections of the database.

In other words, locking and blocking is expected behaviour in a RDBMS.

There are four isolation levels in SQL Server (as quoted from SQL Server Books Online):

  • Read uncommitted (the lowest level where transactions are isolated only enough to ensure that physically corrupt data is not read)
  • Read committed (Database Engine default level)
  • Repeatable read
  • Serializable (the highest level, where transactions are completely isolated from one another)

The default on SQL Server (Read committed) means that readers block writers, and writers block readers. If I’m performing an UPDATE, DELETE or INSERT on a table when you’re trying to run a SELECT, and they happen to be trying to access the same set of 8 KB data pages that make up that table, whichever transaction started first (the one with a lower LSN) will take precedence, and the database engine will lock all the portions of the table that are being affected. Any other operation that also tries to access those portions of the table will be blocked until the transaction is complete. Locks happen thousands of times a second on a busy system, and most of the time we barely notice.

Deadlocks occur when two operations manage to lock each other out. SQL Server will kill a deadlock by terminating the process that was using fewer resources. This means that a long-running SELECT could easily take precedence over a short-running UPDATE. There’s no way of knowing who the deadlock victim will be, which is why our code should always have a retry mechanism built in. Always keep in mind that concurrency, by design, must assume locking, blocking, and deadlocks.

Whither NOLOCK?

We may have used the query hint WITH (NOLOCK) in the past because we noticed that queries seemed to run a lot faster. That’s because the NOLOCK keyword forces the lowest possible isolation level.

Referring back to the Books Online link above (or even the Wikipedia article), Read Uncommitted, or NOLOCK, permits the following bad behaviour:

  • dirty reads – a row that is uncommitted (either has not been committed yet, or may be rolled back) will end up in your query.
  • non-repeatable reads – a row might be read twice, with differing results, because the database engine might decide to delete and reinsert that row at the end of a table, for example.
  • phantom reads – a row that will eventually be deleted, can also show up in your results.

Consider the reliability of these uncommitted reads. Consider the negative impact of making business or financial decisions based on these results.

Don’t use NOLOCK on data that is changing all the time.

RCSI

There’s a fairly painless solution, though. We use NOLOCK in the first place because the default Read Committed isolation level causes readers and writers to block each other. A database-level setting, called Read Committed Snapshot Isolation, will use tempdb to keep track of older versions of committed data.

When a data modification (INSERT, UPDATE, DELETE) takes place, the database engine stores the last committed version of each affected row in an internal table on tempdb. When a SELECT comes along, the engine redirects it to see the last known committed version instead of having to wait for the data modification to complete.

Nothing is for free, though. There is a slight performance impact, relating to the disk requirements for tempdb, along with additional space. However, in all cases that I have implemented RCSI, the benefits far outweigh the costs.

You can read more about this feature in a post by Paul White for more details.

Find me on Twitter at @bornsql if you have any dirty read stories to share.

 

Acids and Databases: A look at ACID and transactions

Relational database management systems (RDBMS) such as SQL Server, Oracle, MySQL, and PostgreSQL use transactions to allow concurrent users to select, insert, update, and delete data without affecting everyone else.

An RDBMS is considered ACID-compliant if it can guarantee data integrity during transactions under the following conditions:

ACID
  • Atomic – transactions are all-or-nothing. If something goes wrong, the entire transaction should go back to a previously known state.

  • Consistent – the data has to follow the internal rules of the database structure, including (but not limited to) data types, foreign key constraints, and default constraints. When we see a failure due to this type of error, the database must revert back to a known state again.

  • Isolated – when I run an operation, it should not impact how you run an operation. My operations are isolated from yours. That does not mean that you and I cannot update the same column in the same table. If your transaction starts before mine, your change will go through first, and then mine will go through. Isolation does not prevent more than one change on the same data, nor is it designed to.

  • Durable – once the transaction has been committed, that data is now persisted, and any failure that occurs after the commit will not affect the durability of the operation.

Transaction Log

This is why the transaction log on SQL Server is so important. It doesn’t just keep track of successful changes in the database. Every single modification in the database must be recorded in the transaction log as it occurs, and each modification has to have a start and an end marker.

If there’s a failure, the rollback will “undo” all the logs to that point from the starting marker (called an LSN, or Log Sequence Number in SQL Server). Yes, that generates even more logs as it makes changes to the database to perform the rollback, until the database is in a state that is equivalent to the moment just before the transaction began.

There are cases where operations from the original query may cause changes to the physical database and file structure, which may not be undone exactly, but the rollback will ensure that the database is at least in a consistent state and that any data modifications performed during a failed transaction are reversed.

Crash Recovery

When SQL Server starts up, sometimes it can take a while for a database to come back online again. That’s because it goes through a process called crash recovery, because there’s no way for the service to know with 100% certainty that it started cleanly for each database.

Think of the transaction log as a replay button.

The transaction log is scanned for active portions which contain information about transactions that were not yet committed. If they contain an end marker, the transaction is rolled forward (it runs the transaction again and modifies the database so that it matches what’s in the transaction log). If there is no end marker, or the original transaction was rolled back, that transaction is rolled back again.

The same process happens when a database is restored from a backup because there may have been transactions in progress when the backup started, some of which may have ended by the time the backup completed.

Implicit vs. Explicit Commit

While some RDBMS like Oracle may require explicit commits (in other words, adding a COMMIT keyword to the end of an operation) to end a transaction, SQL Server has implicit commits by default (no COMMIT keyword is required). We have to be mindful of this when moving between certain platforms. On SQL Server, we need to be sure that the change we are making in a production environment is the right one, or at least test our changes beforehand.

Stay tuned next week when we discuss locking, blocking, isolation levels, and NOLOCK.

Find me on Twitter at @bornsql.

What is a good Disaster Recovery Plan?

During a Q&A session I hosted at our local Calgary SQL Server User Group last month, one of the attendees mentioned an interview question he’d had and wanted to hear my thoughts.

How would you design a disaster recovery solution for a 1 TB database, using Simple Recovery Model? Replication is not permitted, and switching to the Full Recovery Model is not permitted.

Reminder: Simple Recovery Model means no transaction log backups. Only full and differential backups are possible.

My first question to the attendee was what the Service Level Agreement (SLA) says. As we know from previous posts, a disaster recovery strategy is dictated by business requirements, not technical ones. The Recovery Point Objective (how much data loss is acceptable) and Recovery Time Objective (how much time there is to bring everything back) will guide my proposal.

He told me that the SLA was 24 hours, so I started writing on the white board while I was thinking aloud.

On average, a fast storage layer can read and write around 200 MB/s, so it would take 5.12 seconds to write 1 GB, or just under 85 minutes to restore the database back to disk, not counting the log file or crash recovery. I never assume that Instant File Initialization is enabled, plus I won’t know how big the transaction log file is going to be, and that needs to be zeroed out.

The first rule of designing a disaster recovery plan is to expect the worst. Imagine the building has burned down and there’s no Internet access. That means no new server, no operating system, no software installation media. It’s the worst it could be.

Since 85 minutes goes into 24 hours quite comfortably, I then started asking about network speed. Imagine that the backup file has to be copied over an Ethernet connection at (best case) gigabit speeds, which has a theoretical maximum of 125 MB/s. Given that “theoretical” word in there, I always take away 20% and I’m left with 100 MB/s. So assuming the backups are stored on a network device that can deliver our files at gigabit speeds, we’re looking at almost three hours to write the database files, at minimum.

Here we get into assumptions. We have to assume that new hardware can be provided within that 24 hour window. If I don’t have a server to install SQL Server onto, then I won’t be able to meet the SLA. If I don’t have Windows Server (or Linux!) to install onto the server, I won’t meet the SLA. If I cannot install the latest updates to Windows and SQL Server, I won’t meet the SLA.

Now we get into a more challenging aspect: how to take backups so that there’s the minimum amount of downtime and the maximum use of disk space for the backups.

With the Simple Recovery Model, we have our full database backups, which take the entire database and as much of the transaction log as is required to make the database transactionally consistent when restored. In other words, if there are any in-flight transactions that are committed by the time the backup finishes, those should roll forward during restore, and any that don’t commit by then are rolled back. SQL Server handles this for us during the backup itself, and again during the crash recovery phase, but a long-running transaction that has to roll back can delay startup, and I’ve seen cases where this can take several hours.

The other thing we can do with Simple Recovery Model, is differential backups. These are backups that keep track of all the extents in the database that were changed since the last full backup. In many cases, these differential backups will be significantly smaller than the full backup, so it becomes easier to keep a smaller set of files but still have a reasonably up-to-date backup.

Remember that a differential backup is not an incremental backup. If you do a differential backup every night, it will be bigger than it was the previous night.

Of course I’d recommend using backup compression, which has been a standard feature since SQL Server 2008 R2 in Standard Edition, and before that in Enterprise Edition. Compressed database backups take less time to back up and restore, but at a cost of higher CPU utilisation. I think this trade-off is acceptable.

I’d also seriously look at having the backups go offsite to Azure Blob Storage or Amazon S3. If you use Amazon Glacier, keep in mind that there is a built-in delay of four hours before the first file comes back from Glacier.

So, to summarise, if the building has burned down:

  • Do we have electricity?
  • Do we have a network?
  • Do we have a Storage Area Network (SAN)?
  • Do we have a server?
  • Do we have software?
  • Do we have Internet access?
  • Do we have a local copy of the latest backups?
  • Can we get a local copy of the latest backups on site within that 24-hour window?
  • Can we restore the backups and verify them (DBCC CHECKDB) within that 24-hour window?

Sometimes, if it’s a real disaster, the best thing might be to spin up a virtual machine in Azure, Google Compute, or Amazon Web Services, and restore the backup to there. If offsite backups are going to the cloud already, this might be the optimal disaster recovery plan in 2017. A new VM in Azure takes under half an hour to restore. Taking network speed into account, it might take three to six hours to restore the latest full and differential backup files, and then doing a DBCC CHECKDB to verify the database, which adds another two hours or so.

The attendee indicated that he would use SAN replication to bring the database back from the dead. My counter to that was that if it was a true disaster, there wouldn’t be a SAN at all.

If you have any thoughts about how you would tackle this scenario, find me on Twitter at @bornsql.