This week we’re looking at how the database engine stores GUIDs (globally unique identifiers), specifically known as
UNIQUEIDENTIFIER in SQL Server.
If you would like to read about storage of other data types, here are the previous posts in the series:
What is a GUID?
According to Wikipedia, a universally unique identifier (UUID) — also known as a globally unique identifier (GUID) — is a 128-bit value (16 bytes) that is intended to be unique across multiple environments and timespans, and generated without needing a central repository. In theory (because it’s impossible to test), 128 bits is wide enough to ensure a unique value for any GUID as long as the standard is followed.
In this post we’ll see several different terms, but what we need to know is that “GUID” is Microsoft’s term for a UUID, and a GUID is called a
UNIQUEIDENTIFIER in SQL Server.
The concept of a unique identifier has been around for a while, but the most recent version of the standard is defined in RFC 4122, written by the Internet Engineering Task Force (IETF). RFC stands for “request for comment,” and is how most technical standards come about.
While it’s a dry technical read, it’s interesting to understand how the IETF thought about this problem. For instance, did you know that the algorithm was designed to generate ten million identifiers per system per second?
Additionally the algorithm also affects how the identifiers are sorted — which matters a lot in SQL Server — and we’ll cover that later in this post.
What does a GUID look like?
I asked SQL Server to make me a new GUID using the Transact-SQL command
SELECT NEWID(); and I got this back:
According to the RFC, a GUID comprises six octets; this is shown as 5 segments separated by dashes with two of the octets combined in one of the segments. These octets are called time-low, time-mid, time-high-and-version, clock-seq-and-reserved, clock-seq-low, and node. You can read in the RFC about what these mean and how they are generated.
There is no difference between uppercase and lowercase hexadecimal values, so
CC05E271-BACF-4472-901C-957568484405 is the same as
cc05e271-bacf-4472-901c-957568484405. An argument could be made that lowercase hexadecimal is easier to read.
If we take a closer look at our example GUID, we notice that the time-high-and-version octet starts with the number 4. This GUID is generated using the latest RFC standard (which is version 4), followed by the time-high value. In other words, if you want to know which version the GUID follows, check out that first position of the third octet.
It is reasonable to assume that GUIDs generated with older algorithms (indicated by a lower value in this position) will have a higher chance of collision.
If we ask SQL Server to generate ten GUIDs in a row as fast as we can, we might hope to notice a trend emerge. Thankfully this is not the case because the values generated by SQL Server using RFC 4122 are not meant to be predictable. Here is a random sampling generated in a single
SELECT statement. The only consistency is the version number, in the first position of the third octet.
How does SQL Server store GUIDs?
Let’s take our example GUID again:
CC05E271-BACF-4472-901C-957568484405. If we look at the table storage for this row, we’ll find it persisted as follows:
0x71E205CCCFBA7244901C957568484405 (alternating octets are highlighted in bold).
If you haven’t been following this series, this is a good place to remind you that SQL Server stores data using little-endian sequencing on disk and in memory. In the vast majority of cases, bytes are stored in reverse order because that’s how Intel CPUs like their data. However GUIDs are persisted slightly differently because of their sort order.
This is how it breaks down:
0x71E205CCis the first octet, byte-reversed.
0xCFBAis the second octet, byte-reversed.
0x7244is the third octet, byte-reversed. Note the version number in the second byte.
0x90is the fourth octet, which is not byte-reversed.
0x1Cis the fifth octet, not byte-reversed.
0x957568484405is the sixth octet, not byte-reversed.
You may ask yourself, well, how did we get here?
Let’s ask SQL Server to sort our list of ten GUIDs from earlier:
Now let’s pretend we’re looking at the data as it is persisted on disk, using the same order but exploding out the bytes to see them more easily:
04 DC 33 34 3E 15 91 49 B7 FF 05 6F 4A 8D 9D 6F
52 86 BE 5B CC E7 A4 43 96 2F 0A 62 F1 CB 83 0A
80 3B 96 B6 76 32 32 41 93 69 56 A0 BC 9A 60 E7
98 3B 1B E3 94 FC 01 4D B0 13 6C 36 E7 9B 38 EB
71 E2 05 CC CF BA 72 44 90 1C 95 75 68 48 44 05
5F FE 07 39 18 F6 04 4C A6 6C 9F CF AA 48 79 21
DC F0 99 7D EE 19 6B 4A A0 85 B4 75 6A 6C B8 16
AB B8 56 A7 A8 1E 15 42 B1 65 BA AF E9 90 20 D2
38 C2 A0 AD 77 58 96 42 88 0E BE D7 B1 F6 02 DF
7B 7C CC 27 46 BE 6F 48 B7 AF EA B6 9C 5E 66 30
In SQL Server, the
UNIQUEIDENTIFIER is ordered by the sixth octet, known as the node. If you read the RFC, this makes perfect sense. In version 1 of the UUID algorithm the node was based on a machine’s MAC address. All network interface controllers (Ethernet, wireless, Bluetooth, etc.) must have a unique media access control (MAC) identifier, which identifies them on a local network segment and allows normal networking operations to take place.
Over time, MAC address have become a way to identify people using those network devices, so standards have changed to allow MAC addresses to be changed, or pseudorandomized. Nevertheless Microsoft decided years ago to sort GUIDs based on the node octet. As this can be as random as computer algorithms allow, if we build a clustered index on a column that itself is sorted by a randomly-generated value in the 11th byte of a 16-byte wide value, we are going to cause massive page splits in our table every time we insert a row.
Best practice for GUIDs in SQL Server
If you want to use GUIDs as unique identifiers, go ahead. They work very well for that task, however don’t use them as a clustered index. By all means place a non-clustered index on a GUID, but don’t use it as the clustering key as it will hurt performance. It will make your clustered index larger than it needs to be (a
BIGINT is half the size), and it will inflate any non-clustered indexes by an additional 16 bytes per index per row because the clustering key must be included in all non-clustered indexes.
Use GUIDs, but use them with caution.
If the algorithm can generate 10 million values per second, maybe this is one of the reasons SQL Server’s
DATETIME2) data types have a granularity of 100 nanoseconds, because you can fit ten million units of 100ns into a single second. That’s something to ponder.
Share your thoughts in the comments below.