So, like, what is a byte?

A friend of mine in the filmmaking business, who is exceedingly bright but has never worked with SQL Server before, was reading through the first five posts of this Database Fundamentals series, and asked a great question:

“I guess I’m not understanding what a byte is. I think I’m circling the drain in understanding it, but not floating down.”

She has a way with words.

I answered her immediately, but it reminded me that I did get a little carried away with data types, assuming that everyone reading that post would understand what a byte is.

In the innards of the computer is the CPU, or Central Processing Unit (there might be more than one in a server). The CPU is best described as a hot mess of on-off switches. Just as it is in your house, a switch only has two states.

David Hasselhoff, SQL Server DBA

This is what “binary” means. When the CPU clock ticks over, billions of times per second, if a switch is closed, it’s a 1 (electricity can flow to complete the circuit). If the switch is open, it’s a 0 (electricity cannot pass through it).

(Source: https://diytechpro.com/electric-circuit-simple-concept/)

The CPU (and memory, and storage system, and network) understand binary, and the software that sits on top of it uses binary as well.

We end up with a series of 1s and 0s that, when arranged in different combinations, represent information in some form or another. Each of these is a binary digit, or bit.

Through a series of decisions in the old days of computing, when we stick eight of these bits of data together, they form a byte.

Now comes the mathematical part of today’s post.

If we have 8 digits that can store two values each, we get a total of 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 combinations. This is more easily typed as 2^8, or 256. In other words, a byte can store a maximum of 256 values.

Here’s a short list of bytes to give you an example (I have not listed every one of the 256 possibilities). We write the bits in groups of four to make them easier to read.

Binary Decimal ASCII Binary Decimal ASCII
0010 0000 32 <space> 1000 0001 129 Å
0010 0001 33 ! 1000 0010 130 Ç
0010 0010 34 1000 0011 131 É
0010 0011 35 # 1000 0100 132 Ñ
0010 0100 36 $ 1000 0101 133 Ö
0010 0101 37 % 1000 0110 134 Ü
0010 0110 38 & 1000 0111 135 á
0010 0111 39 1000 1000 136 à
0010 1000 40 ( 1000 1001 137 â
0010 1001 41 ) 1000 1010 138 ä
0010 1010 42 * 1000 1011 139 ã
0010 1011 43 + 1000 1100 140 å
0010 1100 44 , 1000 1101 141 ç
0010 1101 45 1000 1110 142 é
0010 1110 46 . 1000 1111 143 è
0010 1111 47 / 1001 0000 144 ê
0011 0000 48 0 1001 0001 145 ë
0011 0001 49 1 1001 0010 146 í
0011 0010 50 2 1001 0011 147 ì
0011 0011 51 3 1001 0100 148 î
0011 0100 52 4 1001 0101 149 ï
0011 0101 53 5 1001 0110 150 ñ
0011 0110 54 6 1001 0111 151 ó
0011 0111 55 7 1001 1000 152 ò
0011 1000 56 8 1001 1001 153 ô
0011 1001 57 9 1001 1010 154 ö
0011 1010 58 : 1001 1011 155 õ
0011 1011 59 ; 1001 1100 156 ú
0011 1100 60 < 1001 1101 157 ù
0011 1101 61 = 1001 1110 158 û
0011 1110 62 > 1001 1111 159 ü
0011 1111 63 ? 1010 0000 160
0100 0000 64 @ 1010 0001 161 °
0100 0001 65 A 1010 0010 162 ¢
0100 0010 66 B 1010 0011 163 £
0100 0011 67 C 1010 0100 164 §
0100 0100 68 D 1010 0101 165
0100 0101 69 E 1010 0110 166
0100 0110 70 F 1010 0111 167 ß
0100 0111 71 G 1010 1000 168 ®
0100 1000 72 H 1010 1001 169 ©
0100 1001 73 I 1010 1010 170
0100 1010 74 J 1010 1011 171 ´
0100 1011 75 K 1010 1100 172 ¨
0100 1100 76 L 1010 1101 173
0100 1101 77 M 1010 1110 174 Æ
0100 1110 78 N 1010 1111 175 Ø
0100 1111 79 O 1011 0000 176
0101 0000 80 P 1011 0001 177 ±
0101 0001 81 Q 1011 0010 178
0101 0010 82 R 1011 0011 179
0101 0011 83 S 1011 0100 180 ¥
0101 0100 84 T 1011 0101 181 µ
0101 0101 85 U 1011 0110 182
0101 0110 86 V 1011 0111 183
0101 0111 87 W 1011 1000 184
0101 1000 88 X 1011 1001 185 π
0101 1001 89 Y 1011 1010 186
0101 1010 90 Z 1011 1011 187 ª
0101 1011 91 [ 1011 1100 188 º
0101 1100 92 \ 1011 1101 189 Ω
0101 1101 93 ] 1011 1110 190 æ
0101 1110 94 ^ 1011 1111 191 ø
0101 1111 95 _ 1100 0000 192 ¿
0110 0000 96 ` 1100 0001 193 ¡
0110 0001 97 a 1100 0010 194 ¬
0110 0010 98 b 1100 0011 195
0110 0011 99 c 1100 0100 196 ƒ
0110 0100 100 d 1100 0101 197
0110 0101 101 e 1100 0110 198
0110 0110 102 f 1100 0111 199 «
0110 0111 103 g 1100 1000 200 »
0110 1000 104 h 1100 1001 201
0110 1001 105 i 1100 1010 202
0110 1010 106 j 1100 1011 203 À
0110 1011 107 k 1100 1100 204 Ã
0110 1100 108 l 1100 1101 205 Õ
0110 1101 109 m 1100 1110 206 Œ
0110 1110 110 n 1100 1111 207 œ
0110 1111 111 o 1101 0000 208
0111 0000 112 p 1101 0001 209
0111 0001 113 q 1101 0010 210
0111 0010 114 r 1101 0011 211
0111 0011 115 s 1101 0100 212
0111 0100 116 t 1101 0101 213
0111 0101 117 u 1101 0110 214 ÷
0111 0110 118 v 1101 0111 215
0111 0111 119 w 1101 1000 216 ÿ
0111 1000 120 x 1101 1001 217 Ÿ
0111 1001 121 y 1101 1010 218
0111 1010 122 z 1101 1011 219
0111 1011 123 { 1101 1100 220
0111 1100 124 | 1101 1101 221
0111 1101 125 } 1101 1110 222
0111 1110 126 ~ 1101 1111 223
0111 1111 127 1110 0000 224
1000 0000 128 Ä 1110 0001 225 ·

There are values missing from the above table, for characters that cannot be displayed correctly in a web browser. For a complete table showing all 256 characters, visit PC Guide.com.

How does this affect Unicode values? If you remember in our post about CHAR, NCHAR, VARCHAR and NVARCHAR data types, we discovered that the Unicode versions (those types starting with N) will use two bytes in memory and on disk to store a single character, compared to the non-Unicode (sometimes called ASCII or plain text) data types, which use only one byte per character.

The high-level reason for this is that some alphabets have more than 256 characters, so the code page (the full set of characters in upper- and lower-case where applicable, plus all the numbers, punctuation marks, and so forth) won’t fit in the 256 possibilities available in a single byte.

When we stick two bytes together however, we suddenly have as many as 2^16 values that we can store, for a total of 65,536 possibilities. This is mostly good enough if you’re not storing Japanese in SQL Server.

There are exceptions to this, where some kanji takes up four bytes per character. This is known as UTF-32 (Unicode Transformation Format, 32 bits per character). The good news is, SQL Server does support multi-byte characters wider than standard (UTF-16) Unicode, as long as we pick the correct collation.

I hope this answers any burning questions you may have had about bits and bytes.

Feel free to reach out to me on Twitter at @bornsql.

Max Server Memory and SQL Server 2016 Service Pack 1

Everything changed for SQL Server Standard Edition on 16 November 2016, and how memory limits work.

On that day, a slew of Enterprise Edition features made their way into editions across the board, including Express Edition and LocalDB.

The memory limit of 128GB RAM applies only to the buffer pool (the 8KB data pages that are read from disk into memory — in other words, the database itself).

For servers containing more than 128GB of physical RAM, and running SQL Server 2016 with Service Pack 1 or higher, we now have options.

The max server memory setting always did only refer to the buffer pool, but for many reasons there was misunderstanding from a lot of people that it included other caches as well.

Because ColumnStore and In-Memory OLTP have their own cache limits over and above the 128GB buffer pool limit, the guidance around assigning max server memory is no longer simple.

ColumnStore now gets an extra 32GB of RAM per instance, while In-Memory OLTP gets an extra 32GB of RAM per database.

With that in mind, you are still welcome to use the Max Server Memory Matrix and associated calculator script for lower versions of SQL Server (up to and including 2014), but I will not be maintaining it further, unless someone finds a bug.

How much should I assign to max server memory? It depends.

It would be very easy to spec a server with 256GB RAM, install a single instance of SQL Server 2016 Standard Edition (with Service Pack 1, of course), have 128GB for the buffer pool, 32GB for the ColumnStore cache, three databases with 32GB of RAM each for In-Memory OLTP, and still run out of memory.

This is a brave new world of memory management. Tread carefully.

If you’d like to share your thoughts, find me on Twitter at @bornsql.

Temporal Tables and Hidden Period Columns

In my November 2015 post, An Introduction to Temporal Tables in SQL Server 2016 using a DeLorean, I wrote:

The HIDDEN property is optional and will hide these columns from a standard SELECT statement for backward compatibility with our application and queries. You cannot apply the HIDDEN property to an existing column.

It turns out that this is no longer true. You can apply the HIDDEN property to an existing period column.

Let’s assume you have a temporal table containing two visible period columns, StartDate and EndDate, which you’d like to hide from a typical SELECT statement.

Using an ALTER TABLE ... ALTER COLUMN statement, simply place the ADD HIDDEN syntax after the period column name(s).

ALTER TABLE [dbo].[Account] ALTER COLUMN [StartDate] ADD HIDDEN;
ALTER TABLE [dbo].[Account] ALTER COLUMN [EndDate] ADD HIDDEN;

You can also remove this flag if you wish, using DROP HIDDEN:

ALTER TABLE [dbo].[Account] ALTER COLUMN [StartDate] DROP HIDDEN;
ALTER TABLE [dbo].[Account] ALTER COLUMN [EndDate] DROP HIDDEN;

This is a great improvement to an already fantastic feature of SQL Server 2016. Thanks to Borko Novakovic for this tip.

If you have any more temporal table tricks you want to share, find me on Twitter at @bornsql.

My surname is NULL

Last Wednesday on Twitter, Abayomi Obawomiye (@SQLAmerica) wrote:

I just met someone with the last name NULL today. I really want to ask if they had issues with the last name but worried might be offensive

Abayomi goes on further to say that the word “NULL” gets replaced with a space in any CSV files generated for reports, which got me thinking about this problem.

Then I realised I had already written about it. We make assumptions on a daily basis when it comes to working with data. In this case, Abayomi assumed that no one’s name could ever be “Null”.

Yes, I’ve made that assumption too.

This sparked a memory in the deep and rusty recesses of my brain. I remember reading an article about someone whose last name is Null. It took me just a few seconds to find the article on Wired Magazine’s website, and the pullout phrase is right in the headline:

Hello, I’m Mr. Null. My Name Makes Me Invisible to Computers

It turns out that Abayomi is not alone with this chance meeting. How many of us consider the possibility that legitimate data in our databases could match reserved keywords?

The author of that article, Christopher Null, has an amusing workaround:

My usual trick is to simply add a period to my name: “Null.” This not only gets around many “null” error blocks, it also adds a sense of finality to my birthright.

In my goal to build systems that are easier to use, that should adapt to the consumer, this is clearly problematic. We shouldn’t be forcing Christopher to change his own name to fit our database model.

How would I go about dealing with this problem? Clearly when exporting data to CSV or generating extracts for other other third-party use, this problem has to be solved.

Delimiters are Good

The easiest way I can think of is for data types to be delimited correctly. When exporting to CSV format, we would ensure that all strings be surrounded by quotation marks. This way, Christopher Null’s record would show up as follows (assuming last name, first name ordering):

"Null","Christopher"

Then the parsing process on the reporting side would import that into whichever system without stumbling over the value.

Another issue raised by Christopher in his Wired article is that his email address is rejected outright by Bank of America because it contains the word null before the @ sign.

First and Last Name columns are bad

I’m going to call myself out on the above example and say that assuming names can fit neatly into two columns is bad. Firstly, Cher and Madonna would complain.

Secondly, Vice Admiral Horatio Nelson, 1st Viscount Nelson, 1st Duke of Bronté, would run out of space before his name could be entered correctly.

This doesn’t even begin to address names that are outside of our mainly Western perspective. In South Africa, I saw names that few developers in North America would consider, making use of all manner of punctuation (including spaces).

My recommendation here is to have a 250-character NVARCHAR column called FullName and leave it at that. If you really want your mail merge software to send “personalised” emails or letters, add an additional PreferredName column, and make sure it’s properly delimited when exporting.

Christopher Null wouldn’t have to bat an eyelid here. It would solve for his problem.

(While I’m on this matter, don’t ask for Gender or Sex. You don’t need it. If you do, you’ll know how to ask properly.)

Email Validation is Stupid

My third rule for ensuring that Nulls can exist in this world with Wests and Obawomiyes is to stop validating email addresses!

Firstly, most developers use a regular expression to do it, which is wrong and terrible and slow, and (clearly) imperfect.

This is a particular bugbear of mine, since I have one character before the @ in my personal email address. Many systems fail when trying to validate it, including WestJet. My own workaround is to create a westjet@ alias so that I can receive email from WestJet.

The best way to validate an email address is to send an email to that address with a link for someone to click on. If they don’t click on the link to close the loop, the email address isn’t valid, and your well-designed system won’t allow them to continue to the next step unless that link is clicked.

Summary

Three simple recommendations could help the Nulls find visibility again and ensure data inserted into a system is not modified unnecessarily.

For more reading on this and similar problems, I recommend visiting Bobby Tables.

If you have any experience in this area, please feel free to contact me on Twitter, at @bornsql .

Temporal Tables and History Retention

I’m a huge fan of Temporal Tables in SQL Server 2016. I first wrote about them, in a four-part series in November 2015, before SQL Server was even released. I don’t always get this excited about new features.

However, it has some limitations. As part of this week’s T-SQL Tuesday, hosted by the attractive and humble Brent Ozar, I have discovered a Microsoft Connect item I feel very strongly about.

Adam Machanic, the creator of an indispensable tool, sp_WhoIsActive, has created a Connect item entitled Temporal Tables: Improve History Retention of Dropped Columns.

As my readers know, temporal tables have to have the same schema as their base tables (the number and order of columns, and their respective data types, have to match).

Where this breaks down is when a table structure has changed on the base table. The history table also needs to take those changes into account, which could potentially result in data loss or redundant columns in the base table.

Adam suggests allowing columns which no longer appear in the base table to be retained in the history table and marked as nullable (or hidden), and should only appear when performing a point-in-time query by referring to the column(s) explicitly.

I have voted for this suggestion, and at the time of writing, it has 16 upvotes. I encourage you to add your voice to this suggestion.

If you have any other suggestions, or wish to discuss temporal tables, please contact me on Twitter at @bornsql .

Wait For Service Pack 1

Conventional wisdom tells us that when Microsoft releases a new version of any server product, we should wait until Service Pack 1 before deploying it to production.

This hasn’t been true for a while now, since Microsoft recommended that Cumulative Updates for SQL Server carry the same confidence:

SQL Server CUs are certified to the same levels as Service Packs, and should be installed at the same level of confidence.

However, Service Pack 1 itself has been mired in some controversy. Microsoft didn’t make things any easier for us with SQL Server 2012, or 2014. Both versions had issues with their respective Service Pack 1.

Fortunately, SQL Server 2016 has broken the cycle, and along with all of the fixes in Cumulative Updates 1–3, and a security fix, we get better feature parity between Enterprise Edition and lower editions, including Standard, Web, Express and LocalDB.

There are some restrictions, of course, but the idea is that developers and vendors can write T-SQL for features that now appear across the board.

SQL Server 2016 Service Pack 1 now includes the following features for Enterprise, Standard, and Web Edition:

  • In-Memory OLTP
  • In-Memory Columnstore
  • Always Encrypted
  • Partitioning
  • Data Compression
  • Change data capture
  • And more!

If you want to take advantage of these features, but you use an older version of SQL Server, you will need to upgrade to SQL Server 2016 with Service Pack 1, but I think this is a no-brainer.

The good news is licences have cost the same since SQL Server 2012, and Standard Edition is almost a quarter of the price of Enterprise Edition.

I maintain that this is the most exciting release of SQL Server since 2005. If you want to upgrade, contact us and we’ll help. We will even upgrade you from SQL Server 6.5.

Temporal Tables in Azure SQL Database

In the latest Microsoft Azure newsletter I received last week was this most excellent news:

Azure SQL Database Temporal Tables generally available

Temporal Tables let customers track the full history of data changes in Azure SQL Database without custom coding. Customers can focus data analysis on a specific point in time and use a declarative cleanup policy to control retention of historical data. Designed to improve productivity when customers develop applications, Temporal Tables can help:

– Support data auditing in applications.
– Analyze trends or detect anomalies over time.
– Implement slowly changing dimension patterns.
– Perform fine-grained row repairs in cases of accidental data errors made by humans or applications.

For more information on how to integrate Temporal Tables in an application, please visit the Getting Started with Temporal Tables in Azure SQL Database documentation webpage. To use temporal retention, please visit the Manage temporal history with retention policy documentation webpage.

Long-time readers of my blog will remember my short series about Temporal Tables in SQL Server 2016. Now it’s time to play with them on Azure SQL Database too!

Changes to Table Schema

With Azure SQL Database, just like SQL Server 2016, you can change the table schema without breaking the link to the history table. From the above Getting Started link, it states that you perform standard ALTER TABLE statements, “and Azure SQL Database will appropriately propagate changes to the history table”. It’s good to see feature parity across products like this. Temporal tables even work on Basic-sized databases.

Go forth and play with Temporal Tables. You’ll no longer audit data changes the same way. Get the benefits of Change Data Capture without the need for massive complexity.

Share your CDC horror stories with me on Twitter, at @bornsql.

On clustered indexes, clustering keys and primary keys

Many smart people have spoken about clustering keys and primary keys before, so here’s a less smart take on the topic.

Let’s define some things first:

Clustered Index – the column (or columns) by which the data in the table will be logically sorted. In other words, barring some exceptions, if you look at the data pages on disk, they will be in the order of the clustered index. A clustered index is the data itself. Can also be called a Clustering Key.

Heap – a table that is unsorted, i.e., it does not contain a clustered index.

Non-Clustered Index – contains a copy of a column (or columns), that make use of the clustered index to map back to the data in a table.

A non-clustered index will always contain a copy of the clustered index as well (which is why a clustered index should be narrow).

If a table does not contain a clustered index (a heap), a non-unique non-clustered index will contain a copy of a Row Identifier (known as the RID), which is an under-the-covers way for SQL Server to uniquely identify a row.

Primary Key – a unique column (or columns), most often used in table joins, to uniquely identify individual rows. A primary key does not have to be a clustering key (but if you use the table designer in Management Studio, it becomes a clustering key by default — this is where a lot of confusion stems from).

Identity Key – a column that increments automatically when a new row is inserted. Can be part of a primary key and a clustered index.

Does my clustered index have to be unique?

Clustered and Non-Clustered Indexes do not have to be unique, but it helps. If a clustered index is not unique, an extra 4-byte “uniquifier” is appended under the covers so that SQL Server knows which row it refers to. Simply speaking, if you’re going to put a clustered index on a table, do it on a unique column (or combination of columns).

Does my table need a clustered index or primary key?

Unless you’re dealing with staging data (and even then I’d argue for a clustered index), all tables should have a primary key and a clustered index.

How do I choose a primary key?

I cannot answer this question for you. There are two main schools of thought:

  • Natural key – the data defines the key. In other words, you have a column that is guaranteed to be unique, in your table. Social Security Numbers are not guaranteed to be unique. Neither are telephone numbers, nor MAC addresses. Don’t use these.
  • Surrogate key – you add a column up front, usually an integer (or a BIGINT if there will be more than 2 billion rows in the table), which is automatically incremented (identity key).

My preference is always for a surrogate key, but this is my decision, and there are knock-on effects that I need to consider as a result of making this choice.

Choosing an appropriate primary key is an exercise for the reader. You can start by reading Grant Fritchey’s recent post entitled “The Clustered Index Is Vital To Your Database Design”.

What does Microsoft say?

Best practice dictates that a table should have a clustered index on it, preferably unique and narrow, with an ever-increasing value.

This type of clustered index helps with reads, updates, and deletes, which are usually the activities a table will see. Insert-heavy tables may have a performance hit on a clustered index because it has to physically sort the data in the right place, but on modern storage subsystems, this is very low overhead.

If you decide to create a clustered index with more than one column, keep in mind that every non-clustered index will contain a copy of the clustered index, so you want to keep this as narrow as possible.

So why does my primary key have a clustered index on it?

For a number of reasons, Microsoft decided that when you create a primary key through the table designer in Management Studio (and in tools even before SQL Server 2005), it automatically makes that primary key a clustered index.

I’m not saying you should blame Microsoft, because in the vast majority of cases, this is perfectly acceptable. But it does make things bad for the wrong type of data type.

Let’s say that you use a randomly-generated value for a primary key. The usual example is a GUID, which is 16 bytes wide and random (compared to the narrow 4-byte integer).

If you create a table in the table designer, and make this GUID column a primary key, SQL Server will automatically make that column a clustered index in the background.

For a clustered index, this is the worst possible data type, so we should choose another column to be the clustered index. We can do that in the designer, but it’s not as simple as clicking the yellow key icon.

This is why more experienced SQL Server DBAs and developers tend to script out tables, adding the clustered index to the script manually, on a different column.

In other words, you can have a primary key without a clustered index (which is the same thing as having a unique non-clustered index on a table), and have the clustered index on a more appropriate column for storing the data.

If you have any questions or comments, please chat to me on Twitter at @bornsql .

What is physical storage anyway?

My friend Jason asked me about his MySQL database export file last week, whether it was sufficient to create a new database. I replied saying that he would have to set up the database separately, because the physical files are allocated in a particular way depending on the web host.

What is a physical file, though? Kenneth Fisher, that crazy cat from SQL Studies, reminded me on Twitter that a physical file, or table, or database, isn’t actually physical. Hard drives are mostly virtualised, especially with web servers, and nothing is physical anymore. Even more confusingly, hard drives are gradually replaced by electric circuits and solid state memory technology that does something called wear-levelling.

When we speak of physical files, or databases, or tables, we’re referring to the way these files are allocated on a storage device. In the old days, pre-2000, it was possible on some operating systems and file systems to allocate the fastest part of the spinning platter of a hard drive (the inside tracks) to processes that needed low seek times, like Relational Database Management Systems (RDBMS).

Nowadays with virtualised storage, including network storage and cloud storage, the physical aspect is meaningless. Spindles are what matter. But what’s a spindle? Until recently, storage was provided by spinning electromagnetic platters called hard drives. The spindle in the center is what we refer to, and the more drives allocated to a storage device, the better.

The problem with spinning drives is that they have a built-in delay for the read-head to locate the correct area on the platter surface, which is referred to as latency. Adding more spindles (more drives) to an array lowers the latency, because files are striped across multiple disks (RAID).

For example, in a three-disk configuration, one file could effectively be split into three parts. That improved read times, and therefore latency, because the file system could look for different parts of the file on different underlying platters, effectively reducing seek times by one-third.

To complicate things even more, solid-state storage, which comes in many forms, but most commonly as USB thumb drives, camera cards (often called memory cards, which is a misnomer), and solid-state drives (SSDs), are gradually replacing hard drives in devices for network and cloud storage. SSDs have no latency because there’s no physical platter to spin up and no read-head to position over a particular spot on the surface. The cells that store the circuits that contain the bits of data are accessible instantly.

However, this speed improvement has a trade-off. Solid state storage has a limited lifespan. The cells can only be written to a certain number of times before failing. The drive controller has to move data all the time, called wear-levelling, to ensure that the blocks of storage are used evenly, thereby extending the life of a drive. Where a file is physically located on the drive, can change at any given moment.

Virtual storage in cloud and network storage devices already muddied the waters with sharing data between virtual machines. Solid-state drives make it even more abstract.

A physical database might refer to the server it’s running on or a virtual machine that is running on a physical server. A physical file might refer to a collection of bits on a single spinning disk drive or SSD, or a collection of bits that are located on several hundred spinning disks.

We have reached a point where a physical storage paradigm is meaningless. When we talk about the physical file on a disk, we mean the abstract representation of the file system’s knowledge of where that file is stored at that point in time. Ultimately, though, it doesn’t mean what it used to.

If you would like to commiserate the loss of high-failure spinning rust, find me on Twitter at @bornsql .

Is DBCC SHRINKFILE (filename, EMPTYFILE) fully logged?

Update: Paul Randal told me I had made a mistake in my post. The correction is below.


On Wednesday last week, Kenneth Fisher (b | t) asked: “Is DBCC SHRINKFILE (filename, EMPTYFILE) fully logged?”

Let’s buckle down and see for ourselves, shall we?

Firstly, what does DBCC SHRINKFILE do?

As we know, SQL Server allows us to reclaim space in data and log files by issuing a DBCC SHRINKFILE command. Books Online goes into some detail about each of the parameters we can use but does not seem to answer Kenneth’s question.

What does the EMPTYFILE parameter do, anyway?

Per Books Online:

EMPTYFILE

Migrates all data from the specified file to other files in the same filegroup. Because the Database Engine no longer allows data to be placed in the empty file, the file can be removed by using the ALTER DATABASE statement.

Let’s say we want to move data from one file in a filegroup to an empty file. We might do this if we want to free up a significant amount of space after a major data cleanup, but the default SHRINKFILE options would take a long time.

Fair enough, but Kenneth wants to know if it is fully logged.

How do we test this?

Let’s create a database called ShrinkFileTest, and then add a new filegroup called FGData, which will in turn contain a single data file called FGFile1.

[All of the following code was executed on a SQL Server 2014 instance.]

CREATE DATABASE [ShrinkFileTest] ON
PRIMARY (NAME = N'ShrinkFileTest',
FILENAME = N'D:\SQL2014\DI\ShrinkFileTest.mdf',
SIZE = 5120 KB, FILEGROWTH = 1024 KB),
FILEGROUP [FGData] (NAME = N'FGFile1',
FILENAME = N'D:\SQL2014\DI\FGFile1.ndf',
SIZE = 2048 MB, FILEGROWTH = 1024 MB)
LOG ON (NAME = N'ShrinkFileTest_log',
FILENAME = N'D:\SQL2014\DI\ShrinkFileTest_log.ldf',
SIZE = 1024 MB, FILEGROWTH = 1024 MB);
GO

ALTER DATABASE [ShrinkFileTest]
SET RECOVERY FULL;
GO

Later, we will add a new file to FGData and use the shrink command to move the data to the new file.

So now let’s populate the FGData filegroup with some test data. Create a table called [dbo].[Test] on the filegroup, and insert ten million rows.

use [ShrinkFileTest];
GO
CREATE TABLE [dbo].[Test] (
[Col1] INT IDENTITY(1,1) PRIMARY KEY,
[Col2] NCHAR(5) DEFAULT N'aaaaa'
) ON [FGData];
GO

-- Load in ten million rows
INSERT INTO [dbo].[Test]
SELECT TOP (10000000) 'aaaaa'
FROM sys.all_objects AS o1
CROSS JOIN sys.all_objects AS o2
CROSS JOIN sys.all_objects AS o3;
GO

We will now create a scenario where most of the rows are deleted but the data file contains a lot of free space.

DELETE FROM [dbo].[Test] WHERE [Col1] % 1000 <> 0;

We should be left with ten thousand rows in the table, at 1,000 intervals (this scenario is designed to create massive free space gaps in the data and ensure each row is on its own data page).

We want to move all the existing data into an empty file, so we can delete the file FGFile1.

Add the second file to the FGData filegroup:

USE [master]
GO
ALTER DATABASE [ShrinkFileTest] ADD FILE (
NAME = N'FGFile2',
FILENAME = N'D:\SQL2014\DI\FGFile2.ndf',
SIZE = 512 MB, FILEGROWTH = 1024 MB
) TO FILEGROUP [FGData];
GO

Before we run the DBCC SHRINKFILE command though, we should flush the transaction log of the tens of thousands of DELETEs (which are fully logged), so that it’s easier to read:

use [ShrinkFileTest];
GO
CHECKPOINT;
GO

Run the shrink command with the EMPTYFILE parameter:

DBCC SHRINKFILE (FGFile1, EMPTYFILE);

Here’s our output from the shrink:

Let’s see what is in the log:

SELECT [Current LSN], [Operation], [Context], [Page ID], [Slot ID]
FROM fn_dblog (NULL, NULL);
GO

Below the CHECKPOINT output, you will see two rows in the log file, both of which are modifying the Page Free Space (PFS) page.

Summary

Kenneth, it is fully logged. In Paul’s words:

Shrink is always fully logged – it’s impossible not to log physical changes to data files.

I hope this satisfies your curiosity.