How SQL Server stores data types: UNIQUEIDENTIFIER

A random selection of tools

This week we’re looking at how the database engine stores GUIDs (globally unique identifiers), specifically known as UNIQUEIDENTIFIER in SQL Server.

If you would like to read about storage of other data types, here are the previous posts in the series:

What is a GUID?

According to Wikipedia, a universally unique identifier (UUID) — also known as a globally unique identifier (GUID) — is a 128-bit value (16 bytes) that is intended to be unique across multiple environments and timespans, and generated without needing a central repository. In theory (because it’s impossible to test), 128 bits is wide enough to ensure a unique value for any GUID as long as the standard is followed.

In this post we’ll see several different terms, but what we need to know is that “GUID” is Microsoft’s term for a UUID, and a GUID is called a UNIQUEIDENTIFIER in SQL Server.

The concept of a unique identifier has been around for a while, but the most recent version of the standard is defined in RFC 4122, written by the Internet Engineering Task Force (IETF). RFC stands for “request for comment,” and is how most technical standards come about.

While it’s a dry technical read, it’s interesting to understand how the IETF thought about this problem. For instance, did you know that the algorithm was designed to generate ten million identifiers per system per second?

Additionally the algorithm also affects how the identifiers are sorted — which matters a lot in SQL Server — and we’ll cover that later in this post.

What does a GUID look like?

I asked SQL Server to make me a new GUID using the Transact-SQL command SELECT NEWID(); and I got this back:

CC05E271-BACF-4472-901C-957568484405

According to the RFC, a GUID comprises six octets; this is shown as 5 segments separated by dashes with two of the octets combined in one of the segments. These octets are called time-low, time-mid, time-high-and-version, clock-seq-and-reserved, clock-seq-low, and node. You can read in the RFC about what these mean and how they are generated.

Octet Value Size
time-low CC05E271 4-hex
time-mid BACF 2-hex
time-high-and-version 4472 2-hex
clock-seq-and-reserved 90 1-hex
clock-seq-low 1C 1-hex
node 957568484405 6-hex

There is no difference between uppercase and lowercase hexadecimal values, so CC05E271-BACF-4472-901C-957568484405 is the same as cc05e271-bacf-4472-901c-957568484405. An argument could be made that lowercase hexadecimal is easier to read.

GUID version

If we take a closer look at our example GUID, we notice that the time-high-and-version octet starts with the number 4. This GUID is generated using the latest RFC standard (which is version 4), followed by the time-high value. In other words, if you want to know which version the GUID follows, check out that first position of the third octet.

It is reasonable to assume that GUIDs generated with older algorithms (indicated by a lower value in this position) will have a higher chance of collision.

If we ask SQL Server to generate ten GUIDs in a row as fast as we can, we might hope to notice a trend emerge. Thankfully this is not the case because the values generated by SQL Server using RFC 4122 are not meant to be predictable. Here is a random sampling generated in a single SELECT statement. The only consistency is the version number, in the first position of the third octet.

CC05E271-BACF-4472-901C-957568484405
7D99F0DC-19EE-4A6B-A085-B4756A6CB816
3433DC04-153E-4991-B7FF-056F4A8D9D6F
E31B3B98-FC94-4D01-B013-6C36E79B38EB
B6963B80-3276-4132-9369-56A0BC9A60E7
3907FE5F-F618-4C04-A66C-9FCFAA487921
ADA0C238-5877-4296-880E-BED7B1F602DF
A756B8AB-1EA8-4215-B165-BAAFE99020D2
27CC7C7B-BE46-486F-B7AF-EAB69C5E6630
5BEF8652-E7CC-43A4-962F-0A62F1CB830A

How does SQL Server store GUIDs?

Let’s take our example GUID again: CC05E271-BACF-4472-901C-957568484405. If we look at the table storage for this row, we’ll find it persisted as follows: 0x71E205CCCFBA7244901C957568484405 (alternating octets are highlighted in bold).

If you haven’t been following this series, this is a good place to remind you that SQL Server stores data using little-endian sequencing on disk and in memory. In the vast majority of cases, bytes are stored in reverse order because that’s how Intel CPUs like their data. However GUIDs are persisted slightly differently because of their sort order.

This is how it breaks down:

  • 0x71E205CC is the first octet, byte-reversed.
  • 0xCFBA is the second octet, byte-reversed.
  • 0x7244 is the third octet, byte-reversed. Note the version number in the second byte.
  • 0x90 is the fourth octet, which is not byte-reversed.
  • 0x1C is the fifth octet, not byte-reversed.
  • 0x957568484405 is the sixth octet, not byte-reversed.

You may ask yourself, well, how did we get here?

Sorting GUIDs

Let’s ask SQL Server to sort our list of ten GUIDs from earlier:

3433DC04-153E-4991-B7FF-056F4A8D9D6F
5BEF8652-E7CC-43A4-962F-0A62F1CB830A
B6963B80-3276-4132-9369-56A0BC9A60E7
E31B3B98-FC94-4D01-B013-6C36E79B38EB
CC05E271-BACF-4472-901C-957568484405
3907FE5F-F618-4C04-A66C-9FCFAA487921
7D99F0DC-19EE-4A6B-A085-B4756A6CB816
A756B8AB-1EA8-4215-B165-BAAFE99020D2
ADA0C238-5877-4296-880E-BED7B1F602DF
27CC7C7B-BE46-486F-B7AF-EAB69C5E6630

Now let’s pretend we’re looking at the data as it is persisted on disk, using the same order but exploding out the bytes to see them more easily:

04 DC 33 34 3E 15 91 49 B7 FF 05 6F 4A 8D 9D 6F
52 86 BE 5B CC E7 A4 43 96 2F 0A 62 F1 CB 83 0A
80 3B 96 B6 76 32 32 41 93 69 56 A0 BC 9A 60 E7
98 3B 1B E3 94 FC 01 4D B0 13 6C 36 E7 9B 38 EB
71 E2 05 CC CF BA 72 44 90 1C 95 75 68 48 44 05
5F FE 07 39 18 F6 04 4C A6 6C 9F CF AA 48 79 21
DC F0 99 7D EE 19 6B 4A A0 85 B4 75 6A 6C B8 16
AB B8 56 A7 A8 1E 15 42 B1 65 BA AF E9 90 20 D2
38 C2 A0 AD 77 58 96 42 88 0E BE D7 B1 F6 02 DF
7B 7C CC 27 46 BE 6F 48 B7 AF EA B6 9C 5E 66 30

In SQL Server, the UNIQUEIDENTIFIER is ordered by the sixth octet, known as the node. If you read the RFC, this makes perfect sense. In version 1 of the UUID algorithm the node was based on a machine’s MAC address. All network interface controllers (Ethernet, wireless, Bluetooth, etc.) must have a unique media access control (MAC) identifier, which identifies them on a local network segment and allows normal networking operations to take place.

Over time, MAC address have become a way to identify people using those network devices, so standards have changed to allow MAC addresses to be changed, or pseudorandomized. Nevertheless Microsoft decided years ago to sort GUIDs based on the node octet. As this can be as random as computer algorithms allow, if we build a clustered index on a column that itself is sorted by a randomly-generated value in the 11th byte of a 16-byte wide value, we are going to cause massive page splits in our table every time we insert a row.

Best practice for GUIDs in SQL Server

If you want to use GUIDs as unique identifiers, go ahead. They work very well for that task, however don’t use them as a clustered index. By all means place a non-clustered index on a GUID, but don’t use it as the clustering key as it will hurt performance. It will make your clustered index larger than it needs to be (a BIGINT is half the size), and it will inflate any non-clustered indexes by an additional 16 bytes per index per row because the clustering key must be included in all non-clustered indexes.

Use GUIDs, but use them with caution.

Final thought

If the algorithm can generate 10 million values per second, maybe this is one of the reasons SQL Server’s TIME (and DATETIME2) data types have a granularity of 100 nanoseconds, because you can fit ten million units of 100ns into a single second. That’s something to ponder.

Share your thoughts in the comments below.

Photo by Ashim D’Silva on Unsplash.

6 thoughts on “How SQL Server stores data types: UNIQUEIDENTIFIER

  • This is an interesting article. I do feel compelled to note that everywhere else I have seen the word “octet”, it refers to _exactly 8 bits_, and in this article, different GUID _fields_ (that is the word for them used by the RFC) are referred to as “octets” even when they comprise _multiple_ octets.

  • Ahh, I see — if you double-check the RFC, I think you’ll find that its use of “octet” is basically the same as other people would use the word “byte”, and that it states that the GUID is, overall, 16 octets (bytes) long but is composed of 6 fields (listed in section 4.1.2 with column header “Field”).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: