Temporal Tables and Hidden Period Columns

In my November 2015 post, An Introduction to Temporal Tables in SQL Server 2016 using a DeLorean, I wrote:

The HIDDEN property is optional and will hide these columns from a standard SELECT statement for backward compatibility with our application and queries. You cannot apply the HIDDEN property to an existing column.

It turns out that this is no longer true. You can apply the HIDDEN property to an existing period column.

Let’s assume you have a temporal table containing two visible period columns, StartDate and EndDate, which you’d like to hide from a typical SELECT statement.

Using an ALTER TABLE ... ALTER COLUMN statement, simply place the ADD HIDDEN syntax after the period column name(s).

ALTER TABLE [dbo].[Account] ALTER COLUMN [StartDate] ADD HIDDEN;
ALTER TABLE [dbo].[Account] ALTER COLUMN [EndDate] ADD HIDDEN;

You can also remove this flag if you wish, using DROP HIDDEN:

ALTER TABLE [dbo].[Account] ALTER COLUMN [StartDate] DROP HIDDEN;
ALTER TABLE [dbo].[Account] ALTER COLUMN [EndDate] DROP HIDDEN;

This is a great improvement to an already fantastic feature of SQL Server 2016. Thanks to Borko Novakovic for this tip.

If you have any more temporal table tricks you want to share, find me on Twitter at @bornsql.

My surname is NULL

Last Wednesday on Twitter, Abayomi Obawomiye (@SQLAmerica) wrote:

I just met someone with the last name NULL today. I really want to ask if they had issues with the last name but worried might be offensive

Abayomi goes on further to say that the word “NULL” gets replaced with a space in any CSV files generated for reports, which got me thinking about this problem.

Then I realised I had already written about it. We make assumptions on a daily basis when it comes to working with data. In this case, Abayomi assumed that no one’s name could ever be “Null”.

Yes, I’ve made that assumption too.

This sparked a memory in the deep and rusty recesses of my brain. I remember reading an article about someone whose last name is Null. It took me just a few seconds to find the article on Wired Magazine’s website, and the pullout phrase is right in the headline:

Hello, I’m Mr. Null. My Name Makes Me Invisible to Computers

It turns out that Abayomi is not alone with this chance meeting. How many of us consider the possibility that legitimate data in our databases could match reserved keywords?

The author of that article, Christopher Null, has an amusing workaround:

My usual trick is to simply add a period to my name: “Null.” This not only gets around many “null” error blocks, it also adds a sense of finality to my birthright.

In my goal to build systems that are easier to use, that should adapt to the consumer, this is clearly problematic. We shouldn’t be forcing Christopher to change his own name to fit our database model.

How would I go about dealing with this problem? Clearly when exporting data to CSV or generating extracts for other other third-party use, this problem has to be solved.

Delimiters are Good

The easiest way I can think of is for data types to be delimited correctly. When exporting to CSV format, we would ensure that all strings be surrounded by quotation marks. This way, Christopher Null’s record would show up as follows (assuming last name, first name ordering):

"Null","Christopher"

Then the parsing process on the reporting side would import that into whichever system without stumbling over the value.

Another issue raised by Christopher in his Wired article is that his email address is rejected outright by Bank of America because it contains the word null before the @ sign.

First and Last Name columns are bad

I’m going to call myself out on the above example and say that assuming names can fit neatly into two columns is bad. Firstly, Cher and Madonna would complain.

Secondly, Vice Admiral Horatio Nelson, 1st Viscount Nelson, 1st Duke of Bronté, would run out of space before his name could be entered correctly.

This doesn’t even begin to address names that are outside of our mainly Western perspective. In South Africa, I saw names that few developers in North America would consider, making use of all manner of punctuation (including spaces).

My recommendation here is to have a 250-character NVARCHAR column called FullName and leave it at that. If you really want your mail merge software to send “personalised” emails or letters, add an additional PreferredName column, and make sure it’s properly delimited when exporting.

Christopher Null wouldn’t have to bat an eyelid here. It would solve for his problem.

(While I’m on this matter, don’t ask for Gender or Sex. You don’t need it. If you do, you’ll know how to ask properly.)

Email Validation is Stupid

My third rule for ensuring that Nulls can exist in this world with Wests and Obawomiyes is to stop validating email addresses!

Firstly, most developers use a regular expression to do it, which is wrong and terrible and slow, and (clearly) imperfect.

This is a particular bugbear of mine, since I have one character before the @ in my personal email address. Many systems fail when trying to validate it, including WestJet. My own workaround is to create a westjet@ alias so that I can receive email from WestJet.

The best way to validate an email address is to send an email to that address with a link for someone to click on. If they don’t click on the link to close the loop, the email address isn’t valid, and your well-designed system won’t allow them to continue to the next step unless that link is clicked.

Summary

Three simple recommendations could help the Nulls find visibility again and ensure data inserted into a system is not modified unnecessarily.

For more reading on this and similar problems, I recommend visiting Bobby Tables.

If you have any experience in this area, please feel free to contact me on Twitter, at @bornsql .

Temporal Tables and History Retention

I’m a huge fan of Temporal Tables in SQL Server 2016. I first wrote about them, in a four-part series in November 2015, before SQL Server was even released. I don’t always get this excited about new features.

However, it has some limitations. As part of this week’s T-SQL Tuesday, hosted by the attractive and humble Brent Ozar, I have discovered a Microsoft Connect item I feel very strongly about.

Adam Machanic, the creator of an indispensable tool, sp_WhoIsActive, has created a Connect item entitled Temporal Tables: Improve History Retention of Dropped Columns.

As my readers know, temporal tables have to have the same schema as their base tables (the number and order of columns, and their respective data types, have to match).

Where this breaks down is when a table structure has changed on the base table. The history table also needs to take those changes into account, which could potentially result in data loss or redundant columns in the base table.

Adam suggests allowing columns which no longer appear in the base table to be retained in the history table and marked as nullable (or hidden), and should only appear when performing a point-in-time query by referring to the column(s) explicitly.

I have voted for this suggestion, and at the time of writing, it has 16 upvotes. I encourage you to add your voice to this suggestion.

If you have any other suggestions, or wish to discuss temporal tables, please contact me on Twitter at @bornsql .

Wait For Service Pack 1

Conventional wisdom tells us that when Microsoft releases a new version of any server product, we should wait until Service Pack 1 before deploying it to production.

This hasn’t been true for a while now, since Microsoft recommended that Cumulative Updates for SQL Server carry the same confidence:

SQL Server CUs are certified to the same levels as Service Packs, and should be installed at the same level of confidence.

However, Service Pack 1 itself has been mired in some controversy. Microsoft didn’t make things any easier for us with SQL Server 2012, or 2014. Both versions had issues with their respective Service Pack 1.

Fortunately, SQL Server 2016 has broken the cycle, and along with all of the fixes in Cumulative Updates 1–3, and a security fix, we get better feature parity between Enterprise Edition and lower editions, including Standard, Web, Express and LocalDB.

There are some restrictions, of course, but the idea is that developers and vendors can write T-SQL for features that now appear across the board.

SQL Server 2016 Service Pack 1 now includes the following features for Enterprise, Standard, and Web Edition:

  • In-Memory OLTP
  • In-Memory Columnstore
  • Always Encrypted
  • Partitioning
  • Data Compression
  • Change data capture
  • And more!

If you want to take advantage of these features, but you use an older version of SQL Server, you will need to upgrade to SQL Server 2016 with Service Pack 1, but I think this is a no-brainer.

The good news is licences have cost the same since SQL Server 2012, and Standard Edition is almost a quarter of the price of Enterprise Edition.

I maintain that this is the most exciting release of SQL Server since 2005. If you want to upgrade, contact us and we’ll help. We will even upgrade you from SQL Server 6.5.

Temporal Tables in Azure SQL Database

In the latest Microsoft Azure newsletter I received last week was this most excellent news:

Azure SQL Database Temporal Tables generally available

Temporal Tables let customers track the full history of data changes in Azure SQL Database without custom coding. Customers can focus data analysis on a specific point in time and use a declarative cleanup policy to control retention of historical data. Designed to improve productivity when customers develop applications, Temporal Tables can help:

– Support data auditing in applications.
– Analyze trends or detect anomalies over time.
– Implement slowly changing dimension patterns.
– Perform fine-grained row repairs in cases of accidental data errors made by humans or applications.

For more information on how to integrate Temporal Tables in an application, please visit the Getting Started with Temporal Tables in Azure SQL Database documentation webpage. To use temporal retention, please visit the Manage temporal history with retention policy documentation webpage.

Long-time readers of my blog will remember my short series about Temporal Tables in SQL Server 2016. Now it’s time to play with them on Azure SQL Database too!

Changes to Table Schema

With Azure SQL Database, just like SQL Server 2016, you can change the table schema without breaking the link to the history table. From the above Getting Started link, it states that you perform standard ALTER TABLE statements, “and Azure SQL Database will appropriately propagate changes to the history table”. It’s good to see feature parity across products like this. Temporal tables even work on Basic-sized databases.

Go forth and play with Temporal Tables. You’ll no longer audit data changes the same way. Get the benefits of Change Data Capture without the need for massive complexity.

Share your CDC horror stories with me on Twitter, at @bornsql.

On clustered indexes, clustering keys and primary keys

Many smart people have spoken about clustering keys and primary keys before, so here’s a less smart take on the topic.

Let’s define some things first:

Clustered Index – the column (or columns) by which the data in the table will be logically sorted. In other words, barring some exceptions, if you look at the data pages on disk, they will be in the order of the clustered index. A clustered index is the data itself. Can also be called a Clustering Key.

Heap – a table that is unsorted, i.e., it does not contain a clustered index.

Non-Clustered Index – contains a copy of a column (or columns), that make use of the clustered index to map back to the data in a table.

A non-clustered index will always contain a copy of the clustered index as well (which is why a clustered index should be narrow).

If a table does not contain a clustered index (a heap), a non-unique non-clustered index will contain a copy of a Row Identifier (known as the RID), which is an under-the-covers way for SQL Server to uniquely identify a row.

Primary Key – a unique column (or columns), most often used in table joins, to uniquely identify individual rows. A primary key does not have to be a clustering key (but if you use the table designer in Management Studio, it becomes a clustering key by default — this is where a lot of confusion stems from).

Identity Key – a column that increments automatically when a new row is inserted. Can be part of a primary key and a clustered index.

Does my clustered index have to be unique?

Clustered and Non-Clustered Indexes do not have to be unique, but it helps. If a clustered index is not unique, an extra 4-byte “uniquifier” is appended under the covers so that SQL Server knows which row it refers to. Simply speaking, if you’re going to put a clustered index on a table, do it on a unique column (or combination of columns).

Does my table need a clustered index or primary key?

Unless you’re dealing with staging data (and even then I’d argue for a clustered index), all tables should have a primary key and a clustered index.

How do I choose a primary key?

I cannot answer this question for you. There are two main schools of thought:

  • Natural key – the data defines the key. In other words, you have a column that is guaranteed to be unique, in your table. Social Security Numbers are not guaranteed to be unique. Neither are telephone numbers, nor MAC addresses. Don’t use these.
  • Surrogate key – you add a column up front, usually an integer (or a BIGINT if there will be more than 2 billion rows in the table), which is automatically incremented (identity key).

My preference is always for a surrogate key, but this is my decision, and there are knock-on effects that I need to consider as a result of making this choice.

Choosing an appropriate primary key is an exercise for the reader. You can start by reading Grant Fritchey’s recent post entitled “The Clustered Index Is Vital To Your Database Design”.

What does Microsoft say?

Best practice dictates that a table should have a clustered index on it, preferably unique and narrow, with an ever-increasing value.

This type of clustered index helps with reads, updates, and deletes, which are usually the activities a table will see. Insert-heavy tables may have a performance hit on a clustered index because it has to physically sort the data in the right place, but on modern storage subsystems, this is very low overhead.

If you decide to create a clustered index with more than one column, keep in mind that every non-clustered index will contain a copy of the clustered index, so you want to keep this as narrow as possible.

So why does my primary key have a clustered index on it?

For a number of reasons, Microsoft decided that when you create a primary key through the table designer in Management Studio (and in tools even before SQL Server 2005), it automatically makes that primary key a clustered index.

I’m not saying you should blame Microsoft, because in the vast majority of cases, this is perfectly acceptable. But it does make things bad for the wrong type of data type.

Let’s say that you use a randomly-generated value for a primary key. The usual example is a GUID, which is 16 bytes wide and random (compared to the narrow 4-byte integer).

If you create a table in the table designer, and make this GUID column a primary key, SQL Server will automatically make that column a clustered index in the background.

For a clustered index, this is the worst possible data type, so we should choose another column to be the clustered index. We can do that in the designer, but it’s not as simple as clicking the yellow key icon.

This is why more experienced SQL Server DBAs and developers tend to script out tables, adding the clustered index to the script manually, on a different column.

In other words, you can have a primary key without a clustered index (which is the same thing as having a unique non-clustered index on a table), and have the clustered index on a more appropriate column for storing the data.

If you have any questions or comments, please chat to me on Twitter at @bornsql .

What is physical storage anyway?

My friend Jason asked me about his MySQL database export file last week, whether it was sufficient to create a new database. I replied saying that he would have to set up the database separately, because the physical files are allocated in a particular way depending on the web host.

What is a physical file, though? Kenneth Fisher, that crazy cat from SQL Studies, reminded me on Twitter that a physical file, or table, or database, isn’t actually physical. Hard drives are mostly virtualised, especially with web servers, and nothing is physical anymore. Even more confusingly, hard drives are gradually replaced by electric circuits and solid state memory technology that does something called wear-levelling.

When we speak of physical files, or databases, or tables, we’re referring to the way these files are allocated on a storage device. In the old days, pre-2000, it was possible on some operating systems and file systems to allocate the fastest part of the spinning platter of a hard drive (the inside tracks) to processes that needed low seek times, like Relational Database Management Systems (RDBMS).

Nowadays with virtualised storage, including network storage and cloud storage, the physical aspect is meaningless. Spindles are what matter. But what’s a spindle? Until recently, storage was provided by spinning electromagnetic platters called hard drives. The spindle in the center is what we refer to, and the more drives allocated to a storage device, the better.

The problem with spinning drives is that they have a built-in delay for the read-head to locate the correct area on the platter surface, which is referred to as latency. Adding more spindles (more drives) to an array lowers the latency, because files are striped across multiple disks (RAID).

For example, in a three-disk configuration, one file could effectively be split into three parts. That improved read times, and therefore latency, because the file system could look for different parts of the file on different underlying platters, effectively reducing seek times by one-third.

To complicate things even more, solid-state storage, which comes in many forms, but most commonly as USB thumb drives, camera cards (often called memory cards, which is a misnomer), and solid-state drives (SSDs), are gradually replacing hard drives in devices for network and cloud storage. SSDs have no latency because there’s no physical platter to spin up and no read-head to position over a particular spot on the surface. The cells that store the circuits that contain the bits of data are accessible instantly.

However, this speed improvement has a trade-off. Solid state storage has a limited lifespan. The cells can only be written to a certain number of times before failing. The drive controller has to move data all the time, called wear-levelling, to ensure that the blocks of storage are used evenly, thereby extending the life of a drive. Where a file is physically located on the drive, can change at any given moment.

Virtual storage in cloud and network storage devices already muddied the waters with sharing data between virtual machines. Solid-state drives make it even more abstract.

A physical database might refer to the server it’s running on or a virtual machine that is running on a physical server. A physical file might refer to a collection of bits on a single spinning disk drive or SSD, or a collection of bits that are located on several hundred spinning disks.

We have reached a point where a physical storage paradigm is meaningless. When we talk about the physical file on a disk, we mean the abstract representation of the file system’s knowledge of where that file is stored at that point in time. Ultimately, though, it doesn’t mean what it used to.

If you would like to commiserate the loss of high-failure spinning rust, find me on Twitter at @bornsql .

Is DBCC SHRINKFILE (filename, EMPTYFILE) fully logged?

Update: Paul Randal told me I had made a mistake in my post. The correction is below.


On Wednesday last week, Kenneth Fisher (b | t) asked: “Is DBCC SHRINKFILE (filename, EMPTYFILE) fully logged?”

Let’s buckle down and see for ourselves, shall we?

Firstly, what does DBCC SHRINKFILE do?

As we know, SQL Server allows us to reclaim space in data and log files by issuing a DBCC SHRINKFILE command. Books Online goes into some detail about each of the parameters we can use but does not seem to answer Kenneth’s question.

What does the EMPTYFILE parameter do, anyway?

Per Books Online:

EMPTYFILE

Migrates all data from the specified file to other files in the same filegroup. Because the Database Engine no longer allows data to be placed in the empty file, the file can be removed by using the ALTER DATABASE statement.

Let’s say we want to move data from one file in a filegroup to an empty file. We might do this if we want to free up a significant amount of space after a major data cleanup, but the default SHRINKFILE options would take a long time.

Fair enough, but Kenneth wants to know if it is fully logged.

How do we test this?

Let’s create a database called ShrinkFileTest, and then add a new filegroup called FGData, which will in turn contain a single data file called FGFile1.

[All of the following code was executed on a SQL Server 2014 instance.]

CREATE DATABASE [ShrinkFileTest] ON
PRIMARY (NAME = N'ShrinkFileTest',
FILENAME = N'D:\SQL2014\DI\ShrinkFileTest.mdf',
SIZE = 5120 KB, FILEGROWTH = 1024 KB),
FILEGROUP [FGData] (NAME = N'FGFile1',
FILENAME = N'D:\SQL2014\DI\FGFile1.ndf',
SIZE = 2048 MB, FILEGROWTH = 1024 MB)
LOG ON (NAME = N'ShrinkFileTest_log',
FILENAME = N'D:\SQL2014\DI\ShrinkFileTest_log.ldf',
SIZE = 1024 MB, FILEGROWTH = 1024 MB);
GO

ALTER DATABASE [ShrinkFileTest]
SET RECOVERY FULL;
GO

Later, we will add a new file to FGData and use the shrink command to move the data to the new file.

So now let’s populate the FGData filegroup with some test data. Create a table called [dbo].[Test] on the filegroup, and insert ten million rows.

use [ShrinkFileTest];
GO
CREATE TABLE [dbo].[Test] (
[Col1] INT IDENTITY(1,1) PRIMARY KEY,
[Col2] NCHAR(5) DEFAULT N'aaaaa'
) ON [FGData];
GO

-- Load in ten million rows
INSERT INTO [dbo].[Test]
SELECT TOP (10000000) 'aaaaa'
FROM sys.all_objects AS o1
CROSS JOIN sys.all_objects AS o2
CROSS JOIN sys.all_objects AS o3;
GO

We will now create a scenario where most of the rows are deleted but the data file contains a lot of free space.

DELETE FROM [dbo].[Test] WHERE [Col1] % 1000 <> 0;

We should be left with ten thousand rows in the table, at 1,000 intervals (this scenario is designed to create massive free space gaps in the data and ensure each row is on its own data page).

We want to move all the existing data into an empty file, so we can delete the file FGFile1.

Add the second file to the FGData filegroup:

USE [master]
GO
ALTER DATABASE [ShrinkFileTest] ADD FILE (
NAME = N'FGFile2',
FILENAME = N'D:\SQL2014\DI\FGFile2.ndf',
SIZE = 512 MB, FILEGROWTH = 1024 MB
) TO FILEGROUP [FGData];
GO

Before we run the DBCC SHRINKFILE command though, we should flush the transaction log of the tens of thousands of DELETEs (which are fully logged), so that it’s easier to read:

use [ShrinkFileTest];
GO
CHECKPOINT;
GO

Run the shrink command with the EMPTYFILE parameter:

DBCC SHRINKFILE (FGFile1, EMPTYFILE);

Here’s our output from the shrink:

Let’s see what is in the log:

SELECT [Current LSN], [Operation], [Context], [Page ID], [Slot ID]
FROM fn_dblog (NULL, NULL);
GO

Below the CHECKPOINT output, you will see two rows in the log file, both of which are modifying the Page Free Space (PFS) page.

Summary

Kenneth, it is fully logged. In Paul’s words:

Shrink is always fully logged – it’s impossible not to log physical changes to data files.

I hope this satisfies your curiosity.

Data Efficiency in SQL Server: DECIMAL

This is part two of a short series of posts about how I assign efficient data types when designing a new table or database.

Use less space with BIGINT

Last week, I spoke about a more efficient DATETIME than DATETIME. This week I have a more efficient DECIMAL than DECIMAL.

If you’re planning to store a long number that will never have a fraction, and you were considering using DECIMAL, I’m going to ask you to think about BIGINT first.

Granted, this is a very specific use-case, and requires some forethought, but that’s our job: to think ahead.

Here’s the scenario where I opted in for a BIGINT over a DECIMAL: IMEI numbers.

According to Wikipedia, the International Mobile Station Equipment Identity (that long number on your mobile device that identifies it uniquely—well, mostly) is usually from 14 to 16 digits in length, and may be as short as 8 digits in length.

A DECIMAL data definition requires two values: precision and scale.

Let’s say you want to store the value 123,456.789. The number of digits in total (both to the left and right of the decimal point) is nine. Thus, the precision is 9.

The scale is everything to the right of the decimal point, which in this case is 3.

In other words, we would define the this value as DECIMAL(9,3).

A precision of 1–9 digits only requires 5 bytes of storage, which makes this far more efficient than a BIGINT, which requires the full 8 bytes of storage, plus it cannot store the fraction.

However, as the precision increases, the storage requirements increase. A precision of 10–19 digits requires 9 bytes (a 45% jump in storage required), and this is the sweet spot where a BIGINT is an effective alternative.

If you do not need to store a fraction, as in the case of an IMEI number, consider BIGINT. 8 bytes will safely store a value up to 19 digits in length (for a maximum value of 2⁶³-1).

So if you’re storing IMEI numbers for a country’s worth of mobile devices, say 250 million rows, one byte will save you 238MB in memory, queries, storage, indexing, backups, and so on.

I think that’s worth thinking about.

Hit me up on Twitter, at @bornsql, to tell me how you’ve made efficient data design choices.

Data Efficiency in SQL Server: DATETIME

This is the first in a short series of posts about how I assign efficient data types when designing a new table or database.

Use less space with DATETIME2

We all know that the DATETIME column uses 8 bytes of space to store the date and time, to an accuracy of a paltry 3 milliseconds.

This used to cause no end of drama in the days before SQL Server 2008 because milliseconds were rounded to end in 0, 3 or 7. When trying to calculate the last possible time in a day, you had to work with some crazy values like 2016-03-15 23:59:59.997.

Fortunately, SQL Server 2008 introduced DATETIME2, which put paid to that drama. We now have a precision of up to 7 places after the decimal, and it still only uses a maximum of 8 bytes!

For example, we can now store a value of 2015-03-15 23:59:59.9999999 which is mind-bogglingly close to midnight, and not worry about insane values ending in a 7.

Do you really need that precision, though? How about a way to actually use less storage and store more accurate data?

If 0–2 places of precision (after the decimal) requires only 6 bytes, 3–4 places requires 7 bytes, and 5–7 places requires the full 8 bytes, we can save ourselves a whole byte and not lose any precision by using DATETIME2(3) as a drop-in replacement for DATETIME.

We get precision down to the millisecond (even more accurate than before) and shave off one byte for every row in that column. In a table with ten million rows, that’s 9.5MB of space we no longer have to store, or query, or index, or back up.

What data efficiency method have you used in production? Let me know on Twitter at @bornsql.