This week we’re looking at how the database engine stores GUIDs (globally unique identifiers), specifically known as UNIQUEIDENTIFIER
in SQL Server.
If you would like to read about storage of other data types, here are the previous posts in the series:
What is a GUID?
According to Wikipedia, a universally unique identifier (UUID) — also known as a globally unique identifier (GUID) — is a 128-bit value (16 bytes) that is intended to be unique across multiple environments and timespans, and generated without needing a central repository. In theory (because it’s impossible to test), 128 bits is wide enough to ensure a unique value for any GUID as long as the standard is followed.
In this post we’ll see several different terms, but what we need to know is that “GUID” is Microsoft’s term for a UUID, and a GUID is called a UNIQUEIDENTIFIER
in SQL Server.
The concept of a unique identifier has been around for a while, but the most recent version of the standard is defined in RFC 4122, written by the Internet Engineering Task Force (IETF). RFC stands for “request for comment,” and is how most technical standards come about.
While it’s a dry technical read, it’s interesting to understand how the IETF thought about this problem. For instance, did you know that the algorithm was designed to generate ten million identifiers per system per second?
Additionally the algorithm also affects how the identifiers are sorted — which matters a lot in SQL Server — and we’ll cover that later in this post.
What does a GUID look like?
I asked SQL Server to make me a new GUID using the Transact-SQL command SELECT NEWID();
and I got this back:
CC05E271-BACF-4472-901C-957568484405
According to the RFC, a GUID comprises six segments; this is shown as 5 segments separated by dashes with two of them combined into one. These segments are called time-low, time-mid, time-high-and-version, clock-seq-and-reserved, clock-seq-low, and node. You can read in the RFC about what these mean and how they are generated.
Segment | Value | Size |
time-low | CC05E271 |
4-hex |
time-mid | BACF |
2-hex |
time-high-and-version | 4472 |
2-hex |
clock-seq-and-reserved | 90 |
1-hex |
clock-seq-low | 1C |
1-hex |
node | 957568484405 |
6-hex |
There is no difference between uppercase and lowercase hexadecimal values, so CC05E271-BACF-4472-901C-957568484405
is the same as cc05e271-bacf-4472-901c-957568484405
. An argument could be made that lowercase hexadecimal is easier to read.
GUID version
If we take a closer look at our example GUID, we notice that the time-high-and-version segment starts with the number 4. This GUID is generated using the latest RFC standard (which is version 4), followed by the time-high value. In other words, if you want to know which version the GUID follows, check out that first position of the third segment.
It is reasonable to assume that GUIDs generated with older algorithms (indicated by a lower value in this position) will have a higher chance of collision.
If we ask SQL Server to generate ten GUIDs in a row as fast as we can, we might hope to notice a trend emerge. Thankfully this is not the case because the values generated by SQL Server using RFC 4122 are not meant to be predictable. Here is a random sampling generated in a single SELECT
statement. The only consistency is the version number, in the first position of the third segment.
CC05E271-BACF-4472-901C-957568484405
7D99F0DC-19EE-4A6B-A085-B4756A6CB816
3433DC04-153E-4991-B7FF-056F4A8D9D6F
E31B3B98-FC94-4D01-B013-6C36E79B38EB
B6963B80-3276-4132-9369-56A0BC9A60E7
3907FE5F-F618-4C04-A66C-9FCFAA487921
ADA0C238-5877-4296-880E-BED7B1F602DF
A756B8AB-1EA8-4215-B165-BAAFE99020D2
27CC7C7B-BE46-486F-B7AF-EAB69C5E6630
5BEF8652-E7CC-43A4-962F-0A62F1CB830A
How does SQL Server store GUIDs?
Let’s take our example GUID again: CC05E271-BACF-4472-901C-957568484405
. If we look at the table storage for this row, we’ll find it persisted as follows: 0x71E205CCCFBA7244901C957568484405
(alternating segments are highlighted in bold).
If you haven’t been following this series, this is a good place to remind you that SQL Server stores data using little-endian sequencing on disk and in memory. In the vast majority of cases, bytes are stored in reverse order because that’s how Intel CPUs like their data. However GUIDs are persisted slightly differently because of their sort order.
This is how it breaks down:
0x71E205CC
is the first segment, byte-reversed.0xCFBA
is the second segment, byte-reversed.0x7244
is the third segment, byte-reversed. Note the version number in the second byte.0x90
is the fourth segment, which is not byte-reversed.0x1C
is the fifth segment, not byte-reversed.0x957568484405
is the sixth segment, not byte-reversed.
You may ask yourself, well, how did we get here?
Sorting GUIDs
Let’s ask SQL Server to sort our list of ten GUIDs from earlier:
3433DC04-153E-4991-B7FF-056F4A8D9D6F
5BEF8652-E7CC-43A4-962F-0A62F1CB830A
B6963B80-3276-4132-9369-56A0BC9A60E7
E31B3B98-FC94-4D01-B013-6C36E79B38EB
CC05E271-BACF-4472-901C-957568484405
3907FE5F-F618-4C04-A66C-9FCFAA487921
7D99F0DC-19EE-4A6B-A085-B4756A6CB816
A756B8AB-1EA8-4215-B165-BAAFE99020D2
ADA0C238-5877-4296-880E-BED7B1F602DF
27CC7C7B-BE46-486F-B7AF-EAB69C5E6630
Now let’s pretend we’re looking at the data as it is persisted on disk, using the same order but exploding out the bytes to see them more easily:
04 DC 33 34 3E 15 91 49 B7 FF 05 6F 4A 8D 9D 6F
52 86 BE 5B CC E7 A4 43 96 2F 0A 62 F1 CB 83 0A
80 3B 96 B6 76 32 32 41 93 69 56 A0 BC 9A 60 E7
98 3B 1B E3 94 FC 01 4D B0 13 6C 36 E7 9B 38 EB
71 E2 05 CC CF BA 72 44 90 1C 95 75 68 48 44 05
5F FE 07 39 18 F6 04 4C A6 6C 9F CF AA 48 79 21
DC F0 99 7D EE 19 6B 4A A0 85 B4 75 6A 6C B8 16
AB B8 56 A7 A8 1E 15 42 B1 65 BA AF E9 90 20 D2
38 C2 A0 AD 77 58 96 42 88 0E BE D7 B1 F6 02 DF
7B 7C CC 27 46 BE 6F 48 B7 AF EA B6 9C 5E 66 30
In SQL Server, the UNIQUEIDENTIFIER
is ordered by the sixth segment, known as the node. If you read the RFC, this makes perfect sense. In version 1 of the UUID algorithm the node was based on a machine’s MAC address. All network interface controllers (Ethernet, wireless, Bluetooth, etc.) must have a unique media access control (MAC) identifier, which identifies them on a local network segment and allows normal networking operations to take place.
Over time, MAC addresses have become a way to identify people using those network devices, so standards have changed to allow MAC addresses to be changed, or pseudorandomized. Nevertheless Microsoft decided years ago to sort GUIDs based on the node segment. As this can be as random as computer algorithms allow, if we build an index on a column that itself is sorted by a randomly-generated value in the 11th byte of a 16-byte wide value, we are going to cause massive page splits in our index every time we insert a row.
Best practice for GUIDs in SQL Server
If you want to use GUIDs as unique identifiers, go ahead. They work very well for that task, and if you proactively rebuild this clustered index when it gets beyond 1% (advice based on a presentation I saw by Jeff Moden), you should be fine — bearing in mind of course that your clustering key will now be 16 bytes wide (a BIGINT
is half the size), and copied into every non-clustered index.
Use GUIDs, but use them with caution.
Final thought
If the algorithm can generate 10 million values per second, maybe this is one of the reasons SQL Server’s TIME
(and DATETIME2
) data types have a granularity of 100 nanoseconds, because you can fit ten million units of 100ns into a single second. That’s something to ponder.
Share your thoughts in the comments below.
Photo by Ashim D’Silva on Unsplash.
“You may ask yourself, well, how did we get here?” — Nice.
How could I resist? 😌
This is an interesting article. I do feel compelled to note that everywhere else I have seen the word “octet”, it refers to _exactly 8 bits_, and in this article, different GUID _fields_ (that is the word for them used by the RFC) are referred to as “octets” even when they comprise _multiple_ octets.
I called them segments when I first wrote the piece but deferred to the RFC.
Ahh, I see — if you double-check the RFC, I think you’ll find that its use of “octet” is basically the same as other people would use the word “byte”, and that it states that the GUID is, overall, 16 octets (bytes) long but is composed of 6 fields (listed in section 4.1.2 with column header “Field”).
Interesting, thanks!
Very interesting! I’d honestly been wondering in a corner of my mind how guids and indexing should play together, and how sorting of them is done. In a scenario using Rank() in SQL, I’ve sometimes had to use a UniqueIdentifier column as a tie breaker between otherwise completely identical rows. Thanks!
Awesome article, Randolph. And a very appropriate graphic for GUIDs, indeed.
Just as a bit of sometimes useful trivia, there IS one other consistency in GUIDs whether they’re the old type 1, the newer type 4, and even the very new NEWSEQUENTIALID() and that is the first character of the 4th “segment” of the GUID. That first character is, of course, 4 bytes wide (a “nibble”).
That nibble is sometimes refered to as the “N” nibble. The first 1, 2, or 3 bits of that nibble are for the sub-version of the GUID type being used.. For all of the built in GUIDs for SQL Server (including the old type 1 GUIDs), the most significant two bits of that nibble are “1” and “0” making the nibble take on the pattern of “10nn” where “n” can only, of course, have a value of “1” or “0”.
That means that the “consistency” of the “N” nibble is that it will only ever contain the values of 8, 9, A, or B.
And I actually recommend that a rebuild occur at >1% logical fragmentation. The 5% thing came from the demo where we demonstrated the craziness that occurs with “Best Practice” index maintenance, which actually isn’t a best practice if, for no other reason, it uses REORGANIZE, which is correctly documented but works nothing like people perceive it to work.
BTW, I REALLY enjoyed your session on EightKB on the subject of data types.
liked the sorting part very much, wasn’t aware about this.
Thanks
Comments are closed.