On clustered indexes, clustering keys and primary keys

Many smart people have spoken about clustering keys and primary keys before, so here’s a less smart take on the topic.

Let’s define some things first:

Clustered Index – the column (or columns) by which the data in the table will be logically sorted. In other words, barring some exceptions, if you look at the data pages on disk, they will be in the order of the clustered index. A clustered index is the data itself. Can also be called a Clustering Key.

Heap – a table that is unsorted, i.e., it does not contain a clustered index.

Non-Clustered Index – contains a copy of a column (or columns), that make use of the clustered index to map back to the data in a table.

A non-clustered index will always contain a copy of the clustered index as well (which is why a clustered index should be narrow).

If a table does not contain a clustered index (a heap), a non-unique non-clustered index will contain a copy of a Row Identifier (known as the RID), which is an under-the-covers way for SQL Server to uniquely identify a row.

Primary Key – a unique column (or columns), most often used in table joins, to uniquely identify individual rows. A primary key does not have to be a clustering key (but if you use the table designer in Management Studio, it becomes a clustering key by default — this is where a lot of confusion stems from).

Identity Key – a column that increments automatically when a new row is inserted. Can be part of a primary key and a clustered index.

Does my clustered index have to be unique?

Clustered and Non-Clustered Indexes do not have to be unique, but it helps. If a clustered index is not unique, an extra 4-byte “uniquifier” is appended under the covers so that SQL Server knows which row it refers to. Simply speaking, if you’re going to put a clustered index on a table, do it on a unique column (or combination of columns).

Does my table need a clustered index or primary key?

Unless you’re dealing with staging data (and even then I’d argue for a clustered index), all tables should have a primary key and a clustered index.

How do I choose a primary key?

I cannot answer this question for you. There are two main schools of thought:

Natural key – the data defines the key. In other words, you have a column that is guaranteed to be unique, in your table. Social Security Numbers are not guaranteed to be unique. Neither are telephone numbers, nor MAC addresses. Don’t use these.

Surrogate key – you add a column up front, usually an integer (or a BIGINT if there will be more than 2 billion rows in the table), which is automatically incremented (identity key).

My preference is always for a surrogate key, but this is my decision, and there are knock-on effects that I need to consider as a result of making this choice.

Choosing an appropriate primary key is an exercise for the reader. You can start by reading Grant Fritchey’s recent post entitled “The Clustered Index Is Vital To Your Database Design”.

What does Microsoft say?

Best practice dictates that a table should have a clustered index on it, preferably unique and narrow, with an ever-increasing value.

This type of clustered index helps with reads, updates, and deletes, which are usually the activities a table will see. Insert-heavy tables may have a performance hit on a clustered index because it has to physically sort the data in the right place, but on modern storage subsystems, this is very low overhead.

If you decide to create a clustered index with more than one column, keep in mind that every non-clustered index will contain a copy of the clustered index, so you want to keep this as narrow as possible.

So why does my primary key have a clustered index on it?

For a number of reasons, Microsoft decided that when you create a primary key through the table designer in Management Studio (and in tools even before SQL Server 2005), it automatically makes that primary key a clustered index.

I’m not saying you should blame Microsoft, because in the vast majority of cases, this is perfectly acceptable. But it does make things bad for the wrong type of data type.

Let’s say that you use a randomly-generated value for a primary key. The usual example is a GUID, which is 16 bytes wide and random (compared to the narrow 4-byte integer). If you create a table in the table designer, and make this GUID column a primary key, SQL Server will automatically make that column a clustered index in the background.

For a clustered index, we need to proactively rebuild this clustered index when it gets beyond 1% (advice based on a presentation I saw by Jeff Moden), which means taking a more custom approach to maintenance plans. Keep in mind of course that your clustering key will now be 16 bytes wide (a BIGINT is half the size), and copied into every non-clustered index.

This is why more experienced SQL Server DBAs and developers tend to script out tables, adding the clustered index to the script manually, on a different column.

In other words, you can have a primary key without a clustered index (which is the same thing as having a unique non-clustered index on a table), and have the clustered index on a more appropriate column for storing the data.

If you have any questions or comments, please chat to me on Twitter at @bornsql .