A quick(er) post this week, in response to Greg Low’s blog post from a few weeks ago titled “Don’t start identity columns or sequences with large negative values.”
Greg writes that you shouldn’t use large negative values in a table, because… it’s hard to read them, I guess? And also they don’t compress well.
I disagree … to a degree. Dang, words are hard. Anyway, when I design a table I create what’s called a surrogate key as my primary key, which is a value that is intended for the table to uniquely identify a row so that it participates in relational activities like joins and foreign keys in an efficient way. In other words the identity column is not for me, it’s for the database engine. I don’t need to worry about what row a value has. I choose the data type for that identity column based on the estimated number of rows, not whether I can memorize that a [StatusID]
of 5 means something. Magic numbers are bad, mmmkay?
If a table will never have more than 255 values, I use a TINYINT
and start the identity counter at zero. If it will have fewer than 32,767 rows, I use a SMALLINT
, and if it will have fewer than 2.1 billion rows, I use an INT
. If for some reason there may be more than 2 billion rows, I then reach for a BIGINT
and hope row compression is turned on.
So far, so good. Greg and I agree on choosing the right data type for the purpose of the identity column, and that in most cases starting the counter at zero (and therefore “losing” all the possible values below zero) is naught much of a loss. Sorry, I’ll try harder next week.
There are cases where you may have far fewer than 2.1 billion rows in your table, but the amount of data churn (rows deleted and rows inserted) can cause the identity value to exceed the INT
limit. In cases like this there are a few options. You could alter the table to increase the INT
to a BIGINT
, and on some versions of SQL Server (especially if you have data compression enabled), this could be a metadata-only operation and finish instantly.
Or you could reseed the identity column to start counting at negative 2.1 billion (the “large negative value” in Greg’s post). Guess what? You’ve just bought yourself a whole lot of time by doing that, and you didn’t need to change your data type.
What happens if you already have two billion rows in your table? Greg suggests that you should be using data compression anyway, and that switching to a BIGINT
here is all that’s needed as the compression algorithm will only use the bytes it needs. Unfortunately row compression isn’t available on non-Enterprise editions prior to SQL Server 2016 with Service Pack 1. If you’re on SQL Server 2012 or 2014, too bad for you.
So your options are:
- you can alter the column to be double the size (at four extra bytes times two billion rows, that’s instantly another 8.5 GB you’ll need), plus a size-of-data operation which will require downtime to make the modification unless you’re very smart; or
- you can reseed the identity column to a large negative value
At least the second option allows you to bring your system online as fast as possible and get your manager off your back. Then you can use that extra time to come up with a plan.
Share your thoughts in the comments below.
Photo by Vishwarajsinh Rana on Unsplash.
This reminds me of a project years ago, I had to migrate a replicated MS-Access database to SQL Server. The replicated db uses Random instead of Increment for the AutoNumber, so the ID’s were all over the place: small & large, plus & minus. When we started the SQL version, the Identity values started from the highest Access value, and based on our growth I estimated we would run out of int values in about two years. To resolve this, I located a large span of available negative ID’s and reseeded the identity value to run from there; now the int values should last beyond the life of the app.
Regarding the argument to not use large negative values for the identity because they are hard to read, the same argument could be used for positive numbers. The number 135583294 is really not any easier or harder to read than its negative value. It is hard to read due to the number of digits.
Formatting a large number with commas makes it much easier to read. 135,583,294 is much more legible than 135583294. If someone wanted to make working with large identity values easier, they could write an SSMS extension to put commas in large numbers within the result pane.
The poor compression is an argument that I do not fully follow the rationale. Compression, as I understand it, is the process of replacing identified patterns with a shorter representation that is put into a dictionary to be used for decompression. The lowest int value should compress very nicely. The value -2,147,483,648 in hex is 80000000. Also, -1 will compress nicely (hex:FFFFFFFF). Integers close to 0 and integers close to its limits (-2,147,483,648 and 2,147,483,647(Hex:7FFFFFFF)).
I have a post coming up next week showing how integer data types are encoded on disk, so this comment feeds nicely into that. Thanks for stopping by!
Comments are closed.