On clustered indexes, clustering keys and primary keys

Many smart people have spoken about clustering keys and primary keys before, so here’s a less smart take on the topic.

Let’s define some things first:

Clustered Index – the column (or columns) by which the data in the table will be logically sorted. In other words, barring some exceptions, if you look at the data pages on disk, they will be in the order of the clustered index. A clustered index is the data itself. Can also be called a Clustering Key.

Heap – a table that is unsorted, i.e., it does not contain a clustered index.

Non-Clustered Index – contains a copy of a column (or columns), that make use of the clustered index to map back to the data in a table.

A non-clustered index will always contain a copy of the clustered index as well (which is why a clustered index should be narrow).

If a table does not contain a clustered index (a heap), a non-unique non-clustered index will contain a copy of a Row Identifier (known as the RID), which is an under-the-covers way for SQL Server to uniquely identify a row.

Primary Key – a unique column (or columns), most often used in table joins, to uniquely identify individual rows. A primary key does not have to be a clustering key (but if you use the table designer in Management Studio, it becomes a clustering key by default — this is where a lot of confusion stems from).

Identity Key – a column that increments automatically when a new row is inserted. Can be part of a primary key and a clustered index.

Does my clustered index have to be unique?

Clustered and Non-Clustered Indexes do not have to be unique, but it helps. If a clustered index is not unique, an extra 4-byte “uniquifier” is appended under the covers so that SQL Server knows which row it refers to. Simply speaking, if you’re going to put a clustered index on a table, do it on a unique column (or combination of columns).

Does my table need a clustered index or primary key?

Unless you’re dealing with staging data (and even then I’d argue for a clustered index), all tables should have a primary key and a clustered index.

How do I choose a primary key?

I cannot answer this question for you. There are two main schools of thought:

  • Natural key – the data defines the key. In other words, you have a column that is guaranteed to be unique, in your table. Social Security Numbers are not guaranteed to be unique. Neither are telephone numbers, nor MAC addresses. Don’t use these.
  • Surrogate key – you add a column up front, usually an integer (or a BIGINT if there will be more than 2 billion rows in the table), which is automatically incremented (identity key).

My preference is always for a surrogate key, but this is my decision, and there are knock-on effects that I need to consider as a result of making this choice.

Choosing an appropriate primary key is an exercise for the reader. You can start by reading Grant Fritchey’s recent post entitled “The Clustered Index Is Vital To Your Database Design”.

What does Microsoft say?

Best practice dictates that a table should have a clustered index on it, preferably unique and narrow, with an ever-increasing value.

This type of clustered index helps with reads, updates, and deletes, which are usually the activities a table will see. Insert-heavy tables may have a performance hit on a clustered index because it has to physically sort the data in the right place, but on modern storage subsystems, this is very low overhead.

If you decide to create a clustered index with more than one column, keep in mind that every non-clustered index will contain a copy of the clustered index, so you want to keep this as narrow as possible.

So why does my primary key have a clustered index on it?

For a number of reasons, Microsoft decided that when you create a primary key through the table designer in Management Studio (and in tools even before SQL Server 2005), it automatically makes that primary key a clustered index.

I’m not saying you should blame Microsoft, because in the vast majority of cases, this is perfectly acceptable. But it does make things bad for the wrong type of data type.

Let’s say that you use a randomly-generated value for a primary key. The usual example is a GUID, which is 16 bytes wide and random (compared to the narrow 4-byte integer).

If you create a table in the table designer, and make this GUID column a primary key, SQL Server will automatically make that column a clustered index in the background.

For a clustered index, this is the worst possible data type, so we should choose another column to be the clustered index. We can do that in the designer, but it’s not as simple as clicking the yellow key icon.

This is why more experienced SQL Server DBAs and developers tend to script out tables, adding the clustered index to the script manually, on a different column.

In other words, you can have a primary key without a clustered index (which is the same thing as having a unique non-clustered index on a table), and have the clustered index on a more appropriate column for storing the data.

If you have any questions or comments, please chat to me on Twitter at @bornsql .

Automation is the new Black

I’ve been toying with the idea of automating a set of diagnostic scripts that I run on customer sites, when doing my checkup. It mirrors the automation that a lot of consulting shops do, but when I run them, I like to spend more time with the customer explaining what each script does and why it only forms a small part of a big picture.

That way, when I present my findings, a lot of the key points have already been covered, and it’s extremely effective (and comforting) to see a customer understand instantly what I’m referring to and ask questions relating directly to that issue.

Although I call that “face time”, it’s more accurately described as “side time” because the customer is sitting beside me, watching me run scripts and dump the results into Excel, talking a lot, speaking with my hands, and so on. Numbers start to blur and they stop caring about what is obviously a very important problem if I’m being paid to figure it out.

It does get a bit overwhelming for the customer, though, especially if they aren’t technically inclined. This is obviously not an efficient use of our time.

So I’m changing things up a little from my side. I’ve figured out how to automate the diagnostics in such a way that I can let them run for about 15 minutes in total (automation means I don’t have to run them in sequence, I can run them all at the same time with a random delay), and once they’re done, produce what is effectively a highlight reel.

Then I can use the rest of the hour to go through these results and explain, using graphs if necessary, the most interesting things I can see. Normally the customer has to wait as long as a day until I produce a 16- to 20-page document, but this way they can see things almost instantly, and if there’s something important to discuss, we can do it immediately.

My consulting style is very conversational (much like this blog), because I want my customer to understand, if not in the same detail as I do (though a significant percentage do), at least what to look for when troubleshooting performance problems and identifying issues with maintenance and disaster recovery planning.

For competitive reasons I can’t disclose my full methodology (I still have to eat!), but I thought I would share some consulting mindshare for a change.

A dalliance with distance

During a recent engagement, I was tasked with making a query faster. Since I signed an NDA (Non-Disclosure Agreement), I can’t go into detail, but it made extensive use of the STDistance() function of the GEOGRAPHY data type.

The query is being called many times a second, and owing to the way the application was designed, is effectively running in a cursor. For each row in a table that contains GPS coordinates, it has to calculate how far the principle GPS coordinates are from that row’s latitude and longitude.

As you can imagine, even with only a few hundred rows, this does not scale well. The database had hundreds of thousands of rows that needed to be calculated each time the query ran because these values are constantly changing.

The data was then being ordered on the STDistance result so that the nearest neighbour showed up first.

After some investigation, I discovered that the spatial index on the coordinates column was not set up correctly. However, we were unable to schedule a maintenance window to rebuild that index. Therefore, I was tasked to make the query faster without modifying any indexes.

To begin with, removing the ORDER BY clause made the query run much more quickly, even though it was running in a loop. After removing the looping structure and refactoring the query to be set based, it ran even more quickly. No surprises there, but the result set was now much larger.

However, it was still causing blocking on the table where coordinates were being stored, and the returning data was unsorted.

Enter the Great Circle Rule. In principle, it states that any distance on the surface of a sphere can be calculated using a simple mathematical formula (known as the Haversine Formula). If there’s one thing I know about computers, it’s that they perform mathematical formulas really well.

The haversine formula is not simple to express in a T-SQL query. Instead, we can use the spherical law of cosines, which is not as accurate (there is a maximum drift of around 2 metres, or a 0.3% margin of error).

(cos c = cos a cos b + sin a sin b cos C)

The advantage of this cosine law is that it’s one line of code, which means it can be used as a drop-in replacement for STDistance.

The customer agreed that for the purpose of this query, 0.3% was well within an acceptable error range. I helped the lead developer rewrite the query to replace STDistance value with the cosine formula, which increased performance by a factor of 1000 and did not need to use the spatial index at all.

In purely T-SQL terms, this looks as follows (the [Latitude] and [Longitude] are columns in the table I’m querying):

-- Latitude of source
DECLARE @Latitude FLOAT(53) = 34.09833
-- Longitude of source
DECLARE @Longitude FLOAT(53) = -118.32583
-- Diameter of the earth, in miles
DECLARE @Diameter FLOAT(53) = 3959

SELECT [ColPK], [Latitude], [Longitude],
ACOS(COS(RADIANS(90 - @Latitude)) * COS(RADIANS(90 - [Latitude]))
+ SIN(RADIANS(90 - @Latitude)) * SIN(RADIANS(90 - [Latitude]))
* COS(RADIANS(@Longitude - [Longitude]))) * @Diameter AS [Distance]
FROM [InterestingTable]

(Source: http://gis.stackexchange.com/a/40929)

Now of course this can be ordered and sorted as required. Because the distance is expressed in miles, we can filter the results to only display values that fall within a specific radius from the source, making the query run even faster.

If you have any neat geography data things to share, find me on Twitter at @bornsql .

The Data Migration Assistant

I’ve written quite a lot about Azure SQL Database recently, but that doesn’t mean I’ve forgotten about the on-premises version of SQL Server.

What could be better than Microsoft announcing a new tool for upgrading to SQL Server 2016? The Data Migration Assistant (DMA) for SQL Server was announced on 26 August 2016.

Data Migration Assistant (DMA) delivers scenarios that reduce the effort to upgrade to a modern data platform by detecting compatibility issues that can impact database functionality after an upgrade. It recommends performance and reliability improvements for your target environment.

Why is this a big deal? Well, as Microsoft themselves put it:

DMA replaces SQL Server Upgrade Advisor.

The Data Migration Assistant helps us make sure that not only is our database instance upgradeable to the latest version of SQL Server, but it also helps us make the best use of new features.

My friend Kenneth Fisher has a post titled The new Data Migration Assistant has been released!, which I encourage you to check out.

We don’t have to run DMA on the SQL Server machine directly. We can install and run it from a workstation and point it at our server. In fact, this is a recommended best practice.

Kenneth and I also strongly expect that later versions of DMA will support Azure SQL Database migrations, which is also exciting (and though I don’t like to read too much into the names Microsoft gives its products, it’s pretty evident from the name that this is where we’re headed).

To share your horror stories about SQL Server Upgrade Advisor, find me on Twitter at @bornsql .