This week we’re looking at how the database engine stores GUIDs (globally unique identifiers), specifically known as
UNIQUEIDENTIFIER in SQL Server.
If you would like to read about storage of other data types, here are the previous posts in the series:
What is a GUID?
According to Wikipedia, a universally unique identifier (UUID) — also known as a globally unique identifier (GUID) — is a 128-bit value (16 bytes) that is intended to be unique across multiple environments and timespans, and generated without needing a central repository. In theory (because it’s impossible to test), 128 bits is wide enough to ensure a unique value for any GUID as long as the standard is followed.
In this post we’ll see several different terms, but what we need to know is that “GUID” is Microsoft’s term for a UUID, and a GUID is called a
UNIQUEIDENTIFIER in SQL Server.
The concept of a unique identifier has been around for a while, but the most recent version of the standard is defined in RFC 4122, written by the Internet Engineering Task Force (IETF). RFC stands for “request for comment,” and is how most technical standards come about.
While it’s a dry technical read, it’s interesting to understand how the IETF thought about this problem. For instance, did you know that the algorithm was designed to generate ten million identifiers per system per second?
Additionally the algorithm also affects how the identifiers are sorted — which matters a lot in SQL Server — and we’ll cover that later in this post.
What does a GUID look like?
I asked SQL Server to make me a new GUID using the Transact-SQL command
SELECT NEWID(); and I got this back:
According to the RFC, a GUID comprises six segments; this is shown as 5 segments separated by dashes with two of them combined into one. These segments are called time-low, time-mid, time-high-and-version, clock-seq-and-reserved, clock-seq-low, and node. You can read in the RFC about what these mean and how they are generated.
There is no difference between uppercase and lowercase hexadecimal values, so
CC05E271-BACF-4472-901C-957568484405 is the same as
cc05e271-bacf-4472-901c-957568484405. An argument could be made that lowercase hexadecimal is easier to read.
If we take a closer look at our example GUID, we notice that the time-high-and-version segment starts with the number 4. This GUID is generated using the latest RFC standard (which is version 4), followed by the time-high value. In other words, if you want to know which version the GUID follows, check out that first position of the third segment.
It is reasonable to assume that GUIDs generated with older algorithms (indicated by a lower value in this position) will have a higher chance of collision.
If we ask SQL Server to generate ten GUIDs in a row as fast as we can, we might hope to notice a trend emerge. Thankfully this is not the case because the values generated by SQL Server using RFC 4122 are not meant to be predictable. Here is a random sampling generated in a single
SELECT statement. The only consistency is the version number, in the first position of the third segment.
How does SQL Server store GUIDs?
Let’s take our example GUID again:
CC05E271-BACF-4472-901C-957568484405. If we look at the table storage for this row, we’ll find it persisted as follows:
0x71E205CCCFBA7244901C957568484405 (alternating segments are highlighted in bold).
If you haven’t been following this series, this is a good place to remind you that SQL Server stores data using little-endian sequencing on disk and in memory. In the vast majority of cases, bytes are stored in reverse order because that’s how Intel CPUs like their data. However GUIDs are persisted slightly differently because of their sort order.
This is how it breaks down:
0x71E205CCis the first segment, byte-reversed.
0xCFBAis the second segment, byte-reversed.
0x7244is the third segment, byte-reversed. Note the version number in the second byte.
0x90is the fourth segment, which is not byte-reversed.
0x1Cis the fifth segment, not byte-reversed.
0x957568484405is the sixth segment, not byte-reversed.
You may ask yourself, well, how did we get here?
Let’s ask SQL Server to sort our list of ten GUIDs from earlier:
Now let’s pretend we’re looking at the data as it is persisted on disk, using the same order but exploding out the bytes to see them more easily:
04 DC 33 34 3E 15 91 49 B7 FF 05 6F 4A 8D 9D 6F
52 86 BE 5B CC E7 A4 43 96 2F 0A 62 F1 CB 83 0A
80 3B 96 B6 76 32 32 41 93 69 56 A0 BC 9A 60 E7
98 3B 1B E3 94 FC 01 4D B0 13 6C 36 E7 9B 38 EB
71 E2 05 CC CF BA 72 44 90 1C 95 75 68 48 44 05
5F FE 07 39 18 F6 04 4C A6 6C 9F CF AA 48 79 21
DC F0 99 7D EE 19 6B 4A A0 85 B4 75 6A 6C B8 16
AB B8 56 A7 A8 1E 15 42 B1 65 BA AF E9 90 20 D2
38 C2 A0 AD 77 58 96 42 88 0E BE D7 B1 F6 02 DF
7B 7C CC 27 46 BE 6F 48 B7 AF EA B6 9C 5E 66 30
In SQL Server, the
UNIQUEIDENTIFIER is ordered by the sixth segment, known as the node. If you read the RFC, this makes perfect sense. In version 1 of the UUID algorithm the node was based on a machine’s MAC address. All network interface controllers (Ethernet, wireless, Bluetooth, etc.) must have a unique media access control (MAC) identifier, which identifies them on a local network segment and allows normal networking operations to take place.
Over time, MAC address have become a way to identify people using those network devices, so standards have changed to allow MAC addresses to be changed, or pseudorandomized. Nevertheless Microsoft decided years ago to sort GUIDs based on the node segment. As this can be as random as computer algorithms allow, if we build an index on a column that itself is sorted by a randomly-generated value in the 11th byte of a 16-byte wide value, we are going to cause massive page splits in our index every time we insert a row.
Best practice for GUIDs in SQL Server
If you want to use GUIDs as unique identifiers, go ahead. They work very well for that task, and if you proactively rebuild this clustered index when it gets beyond 5% (advice based on a presentation I saw by Jeff Moden), you should be fine — bearing in mind of course that your clustering key will now be 16 bytes wide (a
BIGINT is half the size), and copied into every non-clustered index.
Use GUIDs, but use them with caution.
If the algorithm can generate 10 million values per second, maybe this is one of the reasons SQL Server’s
DATETIME2) data types have a granularity of 100 nanoseconds, because you can fit ten million units of 100ns into a single second. That’s something to ponder.
Share your thoughts in the comments below.