How SQL Server stores data types: UNIQUEIDENTIFIER

This week we’re looking at how the database engine stores GUIDs (globally unique identifiers), specifically known as UNIQUEIDENTIFIER in SQL Server.

If you would like to read about storage of other data types, here are the previous posts in the series:

What is a GUID?

According to Wikipedia, a universally unique identifier (UUID) — also known as a globally unique identifier (GUID) — is a 128-bit value (16 bytes) that is intended to be unique across multiple environments and timespans, and generated without needing a central repository. In theory (because it’s impossible to test), 128 bits is wide enough to ensure a unique value for any GUID as long as the standard is followed.

In this post we’ll see several different terms, but what we need to know is that “GUID” is Microsoft’s term for a UUID, and a GUID is called a UNIQUEIDENTIFIER in SQL Server.

The concept of a unique identifier has been around for a while, but the most recent version of the standard is defined in RFC 4122, written by the Internet Engineering Task Force (IETF). RFC stands for “request for comment,” and is how most technical standards come about.

While it’s a dry technical read, it’s interesting to understand how the IETF thought about this problem. For instance, did you know that the algorithm was designed to generate ten million identifiers per system per second?

Additionally the algorithm also affects how the identifiers are sorted — which matters a lot in SQL Server — and we’ll cover that later in this post.

What does a GUID look like?

I asked SQL Server to make me a new GUID using the Transact-SQL command SELECT NEWID(); and I got this back:

CC05E271-BACF-4472-901C-957568484405

According to the RFC, a GUID comprises six segments; this is shown as 5 segments separated by dashes with two of them combined into one. These segments are called time-low, time-mid, time-high-and-version, clock-seq-and-reserved, clock-seq-low, and node. You can read in the RFC about what these mean and how they are generated.

Segment	Value	Size
time-low	`CC05E271`	4-hex
time-mid	`BACF`	2-hex
time-high-and-version	`4472`	2-hex
clock-seq-and-reserved	`90`	1-hex
clock-seq-low	`1C`	1-hex
node	`957568484405`	6-hex

There is no difference between uppercase and lowercase hexadecimal values, so CC05E271-BACF-4472-901C-957568484405 is the same as cc05e271-bacf-4472-901c-957568484405. An argument could be made that lowercase hexadecimal is easier to read.

GUID version

If we take a closer look at our example GUID, we notice that the time-high-and-version segment starts with the number 4. This GUID is generated using the latest RFC standard (which is version 4), followed by the time-high value. In other words, if you want to know which version the GUID follows, check out that first position of the third segment.

It is reasonable to assume that GUIDs generated with older algorithms (indicated by a lower value in this position) will have a higher chance of collision.

If we ask SQL Server to generate ten GUIDs in a row as fast as we can, we might hope to notice a trend emerge. Thankfully this is not the case because the values generated by SQL Server using RFC 4122 are not meant to be predictable. Here is a random sampling generated in a single SELECT statement. The only consistency is the version number, in the first position of the third segment.

CC05E271-BACF-4472-901C-957568484405 7D99F0DC-19EE-4A6B-A085-B4756A6CB816 3433DC04-153E-4991-B7FF-056F4A8D9D6F E31B3B98-FC94-4D01-B013-6C36E79B38EB B6963B80-3276-4132-9369-56A0BC9A60E7 3907FE5F-F618-4C04-A66C-9FCFAA487921 ADA0C238-5877-4296-880E-BED7B1F602DF A756B8AB-1EA8-4215-B165-BAAFE99020D2 27CC7C7B-BE46-486F-B7AF-EAB69C5E6630 5BEF8652-E7CC-43A4-962F-0A62F1CB830A

How does SQL Server store GUIDs?

Let’s take our example GUID again: CC05E271-BACF-4472-901C-957568484405. If we look at the table storage for this row, we’ll find it persisted as follows: 0x71E205CCCFBA7244901C957568484405 (alternating segments are highlighted in bold).

If you haven’t been following this series, this is a good place to remind you that SQL Server stores data using little-endian sequencing on disk and in memory. In the vast majority of cases, bytes are stored in reverse order because that’s how Intel CPUs like their data. However GUIDs are persisted slightly differently because of their sort order.

This is how it breaks down:

0x71E205CC is the first segment, byte-reversed.
0xCFBA is the second segment, byte-reversed.
0x7244 is the third segment, byte-reversed. Note the version number in the second byte.
0x90 is the fourth segment, which is not byte-reversed.
0x1C is the fifth segment, not byte-reversed.
0x957568484405 is the sixth segment, not byte-reversed.

You may ask yourself, well, how did we get here?

Sorting GUIDs

Let’s ask SQL Server to sort our list of ten GUIDs from earlier:

3433DC04-153E-4991-B7FF-056F4A8D9D6F 5BEF8652-E7CC-43A4-962F-0A62F1CB830A B6963B80-3276-4132-9369-56A0BC9A60E7 E31B3B98-FC94-4D01-B013-6C36E79B38EB CC05E271-BACF-4472-901C-957568484405 3907FE5F-F618-4C04-A66C-9FCFAA487921 7D99F0DC-19EE-4A6B-A085-B4756A6CB816 A756B8AB-1EA8-4215-B165-BAAFE99020D2 ADA0C238-5877-4296-880E-BED7B1F602DF 27CC7C7B-BE46-486F-B7AF-EAB69C5E6630

Now let’s pretend we’re looking at the data as it is persisted on disk, using the same order but exploding out the bytes to see them more easily:

04 DC 33 34 3E 15 91 49 B7 FF 05 6F 4A 8D 9D 6F 52 86 BE 5B CC E7 A4 43 96 2F 0A 62 F1 CB 83 0A 80 3B 96 B6 76 32 32 41 93 69 56 A0 BC 9A 60 E7 98 3B 1B E3 94 FC 01 4D B0 13 6C 36 E7 9B 38 EB 71 E2 05 CC CF BA 72 44 90 1C 95 75 68 48 44 05 5F FE 07 39 18 F6 04 4C A6 6C 9F CF AA 48 79 21 DC F0 99 7D EE 19 6B 4A A0 85 B4 75 6A 6C B8 16 AB B8 56 A7 A8 1E 15 42 B1 65 BA AF E9 90 20 D2 38 C2 A0 AD 77 58 96 42 88 0E BE D7 B1 F6 02 DF 7B 7C CC 27 46 BE 6F 48 B7 AF EA B6 9C 5E 66 30

In SQL Server, the UNIQUEIDENTIFIER is ordered by the sixth segment, known as the node. If you read the RFC, this makes perfect sense. In version 1 of the UUID algorithm the node was based on a machine’s MAC address. All network interface controllers (Ethernet, wireless, Bluetooth, etc.) must have a unique media access control (MAC) identifier, which identifies them on a local network segment and allows normal networking operations to take place.

Over time, MAC addresses have become a way to identify people using those network devices, so standards have changed to allow MAC addresses to be changed, or pseudorandomized. Nevertheless Microsoft decided years ago to sort GUIDs based on the node segment. As this can be as random as computer algorithms allow, if we build an index on a column that itself is sorted by a randomly-generated value in the 11th byte of a 16-byte wide value, we are going to cause massive page splits in our index every time we insert a row.

Best practice for GUIDs in SQL Server

If you want to use GUIDs as unique identifiers, go ahead. They work very well for that task, and if you proactively rebuild this clustered index when it gets beyond 1% (advice based on a presentation I saw by Jeff Moden), you should be fine — bearing in mind of course that your clustering key will now be 16 bytes wide (a BIGINT is half the size), and copied into every non-clustered index.

Use GUIDs, but use them with caution.

Final thought

If the algorithm can generate 10 million values per second, maybe this is one of the reasons SQL Server’s TIME (and DATETIME2) data types have a granularity of 100 nanoseconds, because you can fit ten million units of 100ns into a single second. That’s something to ponder.

Share your thoughts in the comments below.

Photo by Ashim D’Silva on Unsplash.

9 thoughts on “How SQL Server stores data types: UNIQUEIDENTIFIER”

Tom Paine June 17, 2020 at 10:47 am

“You may ask yourself, well, how did we get here?” — Nice.
1. randolph June 17, 2020 at 10:48 am
  
  How could I resist? 😌
logiclrd June 29, 2020 at 3:35 pm

This is an interesting article. I do feel compelled to note that everywhere else I have seen the word “octet”, it refers to _exactly 8 bits_, and in this article, different GUID _fields_ (that is the word for them used by the RFC) are referred to as “octets” even when they comprise _multiple_ octets.
1. randolph June 29, 2020 at 3:41 pm
  
  I called them segments when I first wrote the piece but deferred to the RFC.
logiclrd June 29, 2020 at 4:58 pm

Ahh, I see — if you double-check the RFC, I think you’ll find that its use of “octet” is basically the same as other people would use the word “byte”, and that it states that the GUID is, overall, 16 octets (bytes) long but is composed of 6 fields (listed in section 4.1.2 with column header “Field”).
1. randolph June 29, 2020 at 4:58 pm
  
  Interesting, thanks!
Curtis Herrick March 21, 2021 at 6:17 am

Very interesting! I’d honestly been wondering in a corner of my mind how guids and indexing should play together, and how sorting of them is done. In a scenario using Rank() in SQL, I’ve sometimes had to use a UniqueIdentifier column as a tie breaker between otherwise completely identical rows. Thanks!
Jeff Moden August 15, 2021 at 10:12 pm

Awesome article, Randolph. And a very appropriate graphic for GUIDs, indeed.

Just as a bit of sometimes useful trivia, there IS one other consistency in GUIDs whether they’re the old type 1, the newer type 4, and even the very new NEWSEQUENTIALID() and that is the first character of the 4th “segment” of the GUID. That first character is, of course, 4 bytes wide (a “nibble”).

That nibble is sometimes refered to as the “N” nibble. The first 1, 2, or 3 bits of that nibble are for the sub-version of the GUID type being used.. For all of the built in GUIDs for SQL Server (including the old type 1 GUIDs), the most significant two bits of that nibble are “1” and “0” making the nibble take on the pattern of “10nn” where “n” can only, of course, have a value of “1” or “0”.

That means that the “consistency” of the “N” nibble is that it will only ever contain the values of 8, 9, A, or B.

And I actually recommend that a rebuild occur at >1% logical fragmentation. The 5% thing came from the demo where we demonstrated the craziness that occurs with “Best Practice” index maintenance, which actually isn’t a best practice if, for no other reason, it uses REORGANIZE, which is correctly documented but works nothing like people perceive it to work.

BTW, I REALLY enjoyed your session on EightKB on the subject of data types.
Vikas August 20, 2021 at 10:28 am

liked the sorting part very much, wasn’t aware about this.
Thanks

Comments are closed.