Fundamentals of Data Types

Last week, we discussed storing text in a database. This week we will dive deeper into data types.

When storing data in our database, we want to make sure that it’s stored accurately and that we only use the required amount of space.

This is because when we access the data later, we want to make sure any calculations are accurate; plus reading the data takes up memory, and we want to be as efficient as we can with memory usage.

There are seven data type categories in SQL Server:

  • exact numerics
  • approximate numerics
  • date and time
  • character strings
  • Unicode character strings
  • binary strings
  • other

When we want to use these data types for our columns, we need to declare them. Some require a length, some require a precision and scale, and some can be declared without a length at all. For example:

No Length (implied in data type):

Explicit Length (length is supplied):
DECLARE @firstName AS VARCHAR(255);

Precision and Scale:
DECLARE @interestRate AS DECIMAL(9,3);

Let’s talk a bit about precision and scale, because those values between the brackets may not work the way we think they do.

Precision and Scale

Data types with decimal places are defined by what we call fixed precision and scale. Let’s look at an example:


In the above number, we see a six-digit number (ignoring the thousand separator) followed by a decimal point, and then a fraction represented by three decimal places. This number has a scale of 3 (the digits after the decimal point) and a precision of 9 (the digits for the entire value, on both sides of the decimal point). We would declare this value as DECIMAL(9,3).

This is confusing at first glance, because we have to declare it “backwards”, with the precision first, and then the scale. It may be easier to think of the precision in the same way we think of a character string’s length.

Date and time data types can also have decimal places, and SQL Server supports times accurate to the nearest 100 nanoseconds. The most accurate datetime is DATETIME2(7), where 7 decimal places are reserved for the time.

Before SQL Server 2008, we used DATETIME, which is only accurate to the nearest 3 milliseconds, and uses 8 bytes. A drop-in replacement for this is DATETIME2(3), using 3 decimal places, and accurate to the nearest millisecond. It only needs 7 bytes per column.

Be mindful that, as higher precision and scale are required, a column’s storage requirement increases. Accuracy is a trade-off with disk space and memory, so we may find ourselves using floating point values everywhere.

However, in cases where accuracy is required, always stick to exact numerics. Financial calculations, for example, should always use DECIMAL and MONEY data types.

Exact Numerics

Exact Numerics are exact, because any value that is stored is the exact same value that is retrieved later. These are the most common types found in a database, and INT is the most prevalent.

Exact numerics are split up into integers (BIGINT, INT, SMALLINT, TINYINT, BIT) and decimals (NUMERIC, DECIMAL, MONEY, SMALLMONEY). Decimals have decimal places (defined by precision and scale), while integers do not.

Integers have fixed sizes (see table below), so we don’t need to specify a length when declaring this data type.

Type Bytes Range
BIGINT 8 bytes -2^63 to 2^63-1
INT 4 bytes -2^31 to 2^31-1
SMALLINT 2 bytes -2^15 to 2^15-1
TINYINT 1 byte 0 to 255
BIT 1 bit 0 to 1
  • BIT is often used for storing Boolean values, where 1 = True and 0 = False.
  • Yes, BIGINT can store numbers as large as 2 to the power of 63 minus 1. That’s 19 digits wide, with a value of 9,223,372,036,854,775,807, or 9.2 quintillion.

Decimals may vary depending on the precision and scale, so we have to specify those in the declaration.

Type Bytes Range
DECIMAL 5 to 17 bytes Depends on precision and scale.
38 digits is the longest possible precision.
  • DECIMAL and NUMERIC are synonyms and can be used interchangeably. Read more about this data type, and how precision and scale affects bytes used, here.

Although the MONEY and SMALLMONEY data types do have decimal places, they don’t require the precision and scale in the declaration because these are actually synonyms for DECIMAL(19,4) and DECIMAL(10,4) respectively. Think of these data types for convenience more than anything.

Type Bytes Range
MONEY 8 bytes -922,337,203,685,477.5808 to 922,337,203,685,477.5807
SMALLMONEY 4 bytes -214,748.3648 to 214,748.3647

Approximate Numerics

Approximate Numerics mean that the value stored is only approximate. Floating point numbers would be classified as approximate numerics, and these comprise FLOAT and REAL.

Declaring a FLOAT requires a length, which represents the number of bits used to store the mantissa. REAL is a synonym of FLOAT(24).

The mantissa means the significant digits of a number in scientific notation, which is how floating point numbers are represented. The default is FLOAT(53). Generally, we stick to the defaults, and use REAL if we want to save space, forgoing some accuracy of the larger FLOAT(53).

Type Bytes Range
FLOAT 4 or 8 bytes -1.79E+308 to -2.23E-308, 0 (zero),
and 2.23E-308 to 1.79E+308
REAL 4 bytes -3.40E+38 to -1.18E-38, 0 (zero),
and 1.18E-38 to 3.40E+38

Date and Time

Date and time data types are slightly more complex. For storing dates (with no time), we use DATE. We store times (with no dates) using TIME. For storing both date and time in the same column, we can use DATETIME2, DATETIME, or SMALLDATETIME. Finally, we can even store timezone-aware values comprising a date and time and timezone offset, using DATETIMEOFFSET.

DATETIME2, TIME, and DATETIMEOFFSET take a length in their declarations, otherwise they default to 7 (accurate to the nearest 100 nanoseconds).

Character Strings

As we saw last week, characters can be fixed-length (CHAR) or variable-length (VARCHAR), and can support special Unicode character types (NCHAR and NVARCHAR respectively). Collation should also be taken into account.

Length can be 1 to 8000 for CHAR and VARCHAR, or 1 to 4000 for NCHAR and NVARCHAR. For storing values larger than that, see the Large Objects section below.

Binary Strings

Sometimes we want to store binary content in a database. This might be a JPEG image, a Word document, an SSL certificate file, or anything that could traditionally be saved on the file system. SQL Server provides the BINARY and VARBINARY data types for this (and IMAGE for backward compatibility).

Length can be 1 to 8000 for BINARY and VARBINARY. For storing values larger than that, see the Large Object section below.

Large Objects

SQL Server 2008 introduced a new MAX length for several data types, including CHAR, NCHAR, VARCHAR, NVARCHAR, BINARY and VARBINARY.

(The XML data type uses MAX under the covers as well.)

This new specification allows up to 2 GB of data to be stored in a column with that declared length. We should take care not to use 2 GB when inserting data into these columns, but it provides greater flexibility when inserting more than 8000 bytes into one of these columns.

Other Data Types

SQL Server supports other types of data, which fall outside the scope of text and numerics. These include CURSOR, TABLE, XML, UNIQUEIDENTIFIER, TIMESTAMP (not to be confused with the date and time types), HIERARCHYID, SQL_VARIANT, and Spatial Types (GEOGRAPHY and GEOMETRY).

Next week, we will see how normalization and data types work together, now that we have a good overview of the different data types in a database.

If you have any thoughts or comments, please find me on Twitter at @bornsql.

Data Types and Collation

Last week we started with a very simple definition of a database: a discrete set of information, with a specific structure and order to it.

We briefly looked at normalization, which is a way to store as little of the information as possible, so that it stays unique.

We will cover more normalization as we move forward through this series, but first we will talk about how the information, or data, is stored. (This does affect normalization and relationships, even if that is not immediately clear.)

For this week’s discussion, we need to consider a spreadsheet application, like Microsoft Excel or Google Sheets.

Columns and Rows

In a spreadsheet, we have columns and rows. Usually we will also have a header row, so we can distinguish between each column.

Each column, in turn, may be formatted a certain way so that we can easily see what kind of information is in that column. For instance, we may want to format a column as a currency, with leading symbol and the two decimal places at the end.

We may left-align text values, and we may decide that numbers have no decimal places and are right-aligned. Dates and times get their own formatting.

If we were to compare this structure to that of a database, we can imagine that each sheet is a table, and each column and row is a column and row in the table.

In some applications like Microsoft Access, we may hear different terminology for columns and rows, namely fields and records. However, in SQL Server, we maintain the same convention as Excel and call them columns and rows.

Because SQL Server doesn’t care about how our data looks, we have to specify those traits when we create the table. Whether creating from scratch or from an import process through an external application (Excel, for example), we need to specify the data type for each column.

There are several key reasons why we want to do this.

In the case of numbers that will be summarized in some way (sum, average, minimum, maximum, mean, mode), we want SQL Server’s database engine to treat these as numbers internally so that it doesn’t have to convert anything, which in turn makes the calculations much faster.

The same goes for dates, times, and datetimes (where both the date and time is in one column) because the database engine understands date and time calculations, provided the data types are correct.

Text values are also very important but for a fundamentally different reason. While computers understand numbers, it’s humans that understand text.

We will focus the rest of this week’s discussion on storing strings in a database.


Imagine we are developing a database for international customers, and we need to support accented characters or an entirely different alphabet. Database systems use a catch-all term for this, and that is collation.

When we install SQL Server, we are asked what the “default” is, then we are presented with some arcane terminology which may be confusing, so we leave the defaults and click Next.

Collation has to do with how data is sorted, and thus the order in which we see it when data is returned.

Note that collation only affects text columns.

The Windows regional settings, for the user installing SQL Server, will affect the default collation of a SQL Server installation. If we were to install SQL Server on a machine that is configured with U.S. regional settings, it will have a very different default collation than a server that is set for Canada or Finland.

The default SQL Server collation for US regional settings (SQL_Latin1_General_CP1) may need to be changed to match what is required for the user databases that will be running on a server.

The above values mean the following:

  • General – the sort order follows 0-9, A-Z;
  • CP1 – code-page 1, the US English default;
  • Case Insensitivity and Accent Sensitivity are implied (see below).

When not using US English, or the Latin alphabet, we need to be aware that the data’s sort order is taken into account.

Even more confusingly, some vendor products require a specific collation for their database. For example, Microsoft’s own SharePoint database uses the collation Latin1_General_CI_AS_KS_WS:

  • CICase Insensitive – no difference between upper and lower case when sorting data;
  • ASAccent Sensitive – distinguishes between accented characters, for instance, the Afrikaans words “sê” and “se” are considered different;
  • KSKana Sensitive – distinguishes between different Japanese character sets;
  • WSWidth Sensitive – distinguishes between characters that can be expressed by both single- or double-byte characters.

(Read up more about collation options here.)

Text Data Types

Now that we have a very basic grasp of collation, let’s look at text data types.

We tend to use only four text data types in SQL Server these days:

CHAR(n), NCHAR(n), VARCHAR(n), and NVARCHAR(n), where n may be a number between 1 and 8,000 or the keyword MAX.

Why 8,000?

For historic reasons, SQL Server set their data page size (the amount of storage available on each data page, including headers and footers) to 8KB many years ago. This means that the largest amount of data we can store on a single page is 8,192 bytes. Once we take away the header and the slot array at the end, we are left with slightly more than 8,000 bytes for our data.

When we store a text value, we need to decide if the characters can be expressed in a single byte or as double-byte characters (also known as Unicode, using two bytes per character). Alphabets like Kanji, Chinese (Simplified or Traditional), and Turkish, will require double-byte characters, for each character in their alphabet.

(Some code pages need more than two bytes for a character. That is outside of the scope of this discussion.)

So CHAR or VARCHAR uses one byte per character, while NCHAR and NVARCHAR uses two bytes per character (the N represents Unicode).

Thus, the longest a CHAR or VARCHAR string can be is 8000, while the longest an NCHAR or NVARCHAR string can be is 4000 (at two bytes per character).

MAX Data Type

In SQL Server 2008, several new data types were introduced, including the MAX data type for strings and binary data. The underlying storage mechanism was changed to allow columns longer than 8,000 bytes, where these would be stored in another section of the database file under certain conditions.

The MAX data type allows up to 2 GB (more than two billion bytes) for every row that column is used.

So we have to consider three distinct things when deciding how we store text: collation, Unicode, and string length.

Because my readers are terribly intelligent, you’ve already deduced that the VAR in VARCHAR means “variable length”, and you’d be correct.

We use VARCHAR (and its Unicode equivalent NVARCHAR) for columns that will contain strings with variable lengths, including names, addresses, phone numbers, product names, etc. In fact, along with INT (meaning a 4-byte integer), VARCHAR is probably the most common data type in any database today.

CHAR (and NCHAR), on the other hand, are fixed-length data types. We use this type for string lengths that are unlikely to change. For example, IMEI numbers, Vehicle Identification Numbers, social security numbers (where the dash forms part of the number), product codes, serial numbers with leading zeroes, and so on. The point here is that the length is fixed.

So why don’t we just use VARCHAR instead of CHAR everywhere?

Let’s start with why VARCHAR was introduced in the first place, and why we would use it instead of CHAR.

For columns with unpredictably long strings, we don’t want to reserve all 8,000 bytes per row for a string that may only take up 2,000 bytes—and end up wasting 6,000 (not to mention the storage required for a MAX column)—so we switch to VARCHAR, and each row only uses as many bytes as it needs.

However, SQL Server needs to keep track of the length of a VARCHAR column for each row in a table. There is a small overhead of a few bytes per row for every VARCHAR for SQL Server to keep track of this length. The reason we don’t replace CHAR and NCHAR outright, is ironically to save space.

It doesn’t make sense for a table containing millions or billions of rows to use VARCHAR for fixed-length columns because we would be adding on another few bytes per row as unnecessary overhead. Adding just one byte per million rows is roughly 1 MB of storage.

Extrapolating that extra byte to the memory required to hold it, maintenance plans when updating indexes and statistics, backups, replicated databases, and so on, we are now looking at extra megabytes, and possibly gigabytes, for the sake of convenience.

We must make sure that we pick the correct character type for storing strings, beyond just the length of the string. Both CHAR and VARCHAR have their place.

While we did spend most of this discussion on collations and text, we’ve only scratched the surface.

Next week, we will discuss how to pick the right data type for your columns, with concrete examples. This matters a lot with how numbers are stored.

If you have any feedback, find me on Twitter at @bornsql.

What is a Database?

What is Microsoft SQL Server?

To answer that, we have to ask what a relational database management server (RDBMS) is.

To answer that, we have to ask what a relational database is.

To answer that, we have to understand what the relational model is, what a database is, and how these two concepts combine to form what is effectively the basis of most technology today: the database.


A database is, fundamentally, a collection of information that is stored in a defined structure.

A telephone book is a database. A recipe book is a database. Each of these books has a set of rules that define how the information is stored and retrieved. That is the structure.

Database Query

When we want to retrieve information, we query the structure with language appropriate to the database. Tell me the phone number of Randolph West, by looking up the surnames and going through all the Ws. Find the recipe for lemon meringue by going through the desserts and searching for meringue.

In a RDBMS, the language is called Structured Query Language, or SQL. You can pronounce it like “sequel”, or say each letter.

Relational Model

A few decades ago, IBM employee and computer scientist Edgar Codd developed his “Twelve Rules”, dictating how data should be laid out in a relational manner, using first-order predicate logic.

There’s no really easy way to explain what this exactly means at a philosophical or mathematical level, especially not in this forum, so I will explain what makes a relational database instead.

Imagine you want to buy a new iPhone. You walk into the store, and find a Genius. Her name tag says Thandi, and she takes you to the desk where the various models are displayed.

You decide after a few minutes that you want to get the glossy black one with the big screen and lots of storage, because in this imaginary scenario, you have lots of disposable income.

You also select Apple Care on the phone, and for a little bonus, you get yourself a blue leather phone cover.

Out comes the credit card, the transaction is approved, and Thandi and you exchange Twitter handles and decide to be friends. Everyone is happy.

Here’s what happens from the relational perspective, in a high level overview:

Stock Item
Product iPhone
Model iPhone 7 Plus
Storage Capacity 256GB
Colour Black
Style Glossy
Carrier Unlocked
Serial Number xxx
Apple Care Yes
Purchase Date 2017-03-08
Price $1000
Stock Item
Product iPhone cover
Model iPhone 7 Plus
Colour Blue
Style Leather
Purchase Date 2017-03-09
Price $50
Store name Cross Iron Mills, Calgary
Salesperson [email protected]
Date 2017-03-08
Customer [email protected]
Payment method Credit Card

According to Codd’s paper on relational theory (PDF), these items should be defined by their natural structure, and that there should be one way, and only one way, to uniquely identify each item in that purchase event, from you and Thandi, to the phone cover and Apple Care, and how they relate to each other to create a single, unique transaction.

This is called normalization. The transaction can only be recorded once, but in a way that includes all the different information.

Although Apple has sold billions of phones, there is only one product category in their stock called iPhone. It would be reasonable to assume that their database contains a table called Products, and possible values include Mac, iPhone, iPad and iPod. Models would be in their own unique list.

In the same vein, there is a single list of possible colours that these products can be: white, black, silver, gold, red, blue, rose gold. These would go into a database table called Colours.

There are only a few storage capacities they can be: 32 GB, 128 GB, 256 GB. There’s the Storage table.

There would also be a table of Stores, a table of Staff, and a table of Customers. Each possible item would appear only once in these tables.

To distinguish between each unique value in the table, each record, or row, contains a unique identifier, or what we call a Primary Key, or PK for short (sometimes called an identity column, but this is not always the case). This is usually a number, because computers process numbers very efficiently, but can be any data type (we will cover data types in the next post).

For example, in the Colours table, the primary key might be 2 for black. For Products, the phone’s identifier might be 19, because Apple created several product categories before the iPhone was invented. The Storage primary key might be 5. Thandi might have a Staff PK of 322,544. Your Customer PK value might be 477,211,549.

All this information so far doesn’t tell us anything about the transaction itself. To do that, we have to associate each of these Primary Keys from each table, into a single row in a Transactions table, where we record the date and time, along with the sale amount and GST.

This association is called a relationship, and this is where we get the notion of relational data. The data in the Transactions table only has meaning, when it relates to the other tables.

In other words, all of these elements can be used to uniquely identify the sale of the phone and the phone cover to you, provided they all link back to a unique transaction PK.

When we use the values of those primary keys in a table to refer back to the original table, we call them Foreign Keys (FKs). This is because the Transactions table would have its own PK, as mentioned above, which in our example might be named the TransactionID column.

Next week, we will dive slightly deeper into data types and collation. Stay tuned.

If you have any feedback, please let me know on Twitter at @bornsql.

Max Server Memory and SQL Server 2016 Service Pack 1

Everything changed for SQL Server Standard Edition on 16 November 2016, and how memory limits work.

On that day, a slew of Enterprise Edition features made their way into editions across the board, including Express Edition and LocalDB.

The memory limit of 128GB RAM applies only to the buffer pool (the 8KB data pages that are read from disk into memory — in other words, the database itself).

For servers containing more than 128GB of physical RAM, and running SQL Server 2016 with Service Pack 1 or higher, we now have options.

The max server memory setting always did only refer to the buffer pool, but for many reasons there was misunderstanding from a lot of people that it included other caches as well.

Because ColumnStore and In-Memory OLTP have their own cache limits over and above the 128GB buffer pool limit, the guidance around assigning max server memory is no longer simple.

ColumnStore now gets an extra 32GB of RAM per instance, while In-Memory OLTP gets an extra 32GB of RAM per database.

With that in mind, you are still welcome to use the Max Server Memory Matrix and associated calculator script for lower versions of SQL Server (up to and including 2014), but I will not be maintaining it further, unless someone finds a bug.

How much should I assign to max server memory? It depends.

It would be very easy to spec a server with 256GB RAM, install a single instance of SQL Server 2016 Standard Edition (with Service Pack 1, of course), have 128GB for the buffer pool, 32GB for the ColumnStore cache, three databases with 32GB of RAM each for In-Memory OLTP, and still run out of memory.

This is a brave new world of memory management. Tread carefully.

If you’d like to share your thoughts, find me on Twitter at @bornsql.

SQL Modernization and Cloud Migration Talk in Calgary

On 28 February 2017, I will be speaking at Microsoft Calgary as part of a roadshow for AMTRA Solutions. Lunch will be provided.

You can reserve your place for my session on Eventbrite, and there is no charge. The event starts at 9am.

The venue is at 110 9th Avenue SW (Suite 800), Calgary AB, T2P 0T1.

I will be discussing some of the new features of SQL Server and Azure SQL Database, in the context of modernizing your data platform infrastructure,  and there will be live demos.

After lunch, fellow MVP Kevin Kaminski will be discussing Windows 10 and its place in a modern and secure network. I’ll be staying to watch that session, which you can book for here.

Please stop by!

Testing for Object Existence: CREATE OR ALTER

For the longest time, T-SQL writers have had to wrestle with ways of testing for an object’s existence so that it can either be dropped and recreated, or modified as needed.

Last week we covered the new DROP ... IF EXISTS syntax. This week goes into how we handle changes to objects.

We’ve spent many hours of our lives fighting with an object existence check so that ALTER commands don’t fail.

The common implementation pattern now is to CREATE a dummy object with the appropriate name and then use ALTER to write the actual code. That way, future changes can just be done by using the ALTER keyword.

We can see this in a famous example: sp_WhoIsActive, an extremely popular stored procedure written by Adam Machanic, which I highly recommend installing when setting up a new SQL Server instance.

USE master;

WHERE ROUTINE_NAME = 'sp_WhoIsActive')
EXEC ('CREATE PROC dbo.sp_WhoIsActive AS
SELECT ''stub version, to be replaced''');
ALTER PROC dbo.sp_WhoIsActive (...);

Look at all that unnecessary code. It’s messy; it’s prone to errors. There are so many ways to do it, which makes it inconsistent, meaning that automatically searching through a code base isn’t reliable. Additionally, if we miss a single quotation mark, the entire script fails.

In a word: Ugh!

What happens if we forget to run the stub first? What happens if we have an existing object and run the CREATE accidentally, then we have to manually change it to an ALTER

(Side note: I spent a good few minutes one day a few years ago not remembering the ALTER keyword.)

SQL Server 2016 Service Pack 1 has finally added in a feature that many DBAs and database developers have been clamouring for: CREATE OR ALTER.

In the above example, the entire IF NOT EXISTS section can be replaced with:

USE master;

CREATE OR ALTER PROC dbo.sp_WhoIsActive (...);

While this has been a long time coming, causing many anguished cries from people writing T-SQL scripts over the years, we are going to love using this new, small, yet significant, syntax.

Share your best ALTER story with me on Twitter, at @bornsql.

Testing for Object Existence: DROP … IF EXISTS

For the longest time, T-SQL writers have had to wrestle with ways of testing for an object’s existence so that it can either be dropped and recreated, or modified as needed.

This is especially common in the case of temp tables or table variables. If the object already exists when our script runs, the script will fail and leave our workflow in an inconsistent state.

Consider this script:

IF OBJECT_ID('tempdb..#doTheThing') IS NOT NULL
DROP TABLE #doTheThing;

CREATE TABLE #doTheThing (
ImportantColumn1 BIGINT,
ImportantColumn2 TINYINT,
ImportantColumn3 NVARCHAR(255)

With SQL Server 2016, we can now do this:


CREATE TABLE #doTheThing (
ImportantColumn1 BIGINT,
ImportantColumn2 TINYINT,
ImportantColumn3 NVARCHAR(255)


Share your object existence check nightmares with me on Twitter at @bornsql.

Configuration Manager Shortcut on Windows 10 and Server 2016

This is more for my own reference than anything.

On newer versions of Windows desktop and Windows Server, we may find that the shortcut to SQL Server Configuration Manager is missing.

According to this MSDN article, the reason for this change is that Configuration Manager is a Management Console snap-in:

Because SQL Server Configuration Manager is a snap-in for the Microsoft Management Console program and not a stand-alone program, SQL Server Configuration Manager does not appear as an application in newer versions of Windows.

I think this is ridiculous because it does not maintain backward compatibility.

This is especially frustrating because the same article reminds us that all changes to SQL Server services should be managed through the Configuration Manager.

The workaround is to create our own shortcut as follows:

SQL Server Version Path for Shortcut
SQL Server 2008 / R2 C:\Windows\SysWOW64\SQLServerManager10.msc
SQL Server 2012 C:\Windows\SysWOW64\SQLServerManager11.msc
SQL Server 2014 C:\Windows\SysWOW64\SQLServerManager12.msc
SQL Server 2016 C:\Windows\SysWOW64\SQLServerManager13.msc

Share your frustrations with the “modern” Windows UI with me on Twitter at @bornsql.

Temporal Tables and Hidden Period Columns

In my November 2015 post, An Introduction to Temporal Tables in SQL Server 2016 using a DeLorean, I wrote:

The HIDDEN property is optional and will hide these columns from a standard SELECT statement for backward compatibility with our application and queries. You cannot apply the HIDDEN property to an existing column.

It turns out that this is no longer true. You can apply the HIDDEN property to an existing period column.

Let’s assume you have a temporal table containing two visible period columns, StartDate and EndDate, which you’d like to hide from a typical SELECT statement.

Using an ALTER TABLE ... ALTER COLUMN statement, simply place the ADD HIDDEN syntax after the period column name(s).


You can also remove this flag if you wish, using DROP HIDDEN:


This is a great improvement to an already fantastic feature of SQL Server 2016. Thanks to Borko Novakovic for this tip.

If you have any more temporal table tricks you want to share, find me on Twitter at @bornsql.

My surname is NULL

Last Wednesday on Twitter, Abayomi Obawomiye (@SQLAmerica) wrote:

I just met someone with the last name NULL today. I really want to ask if they had issues with the last name but worried might be offensive

Abayomi goes on further to say that the word “NULL” gets replaced with a space in any CSV files generated for reports, which got me thinking about this problem.

Then I realised I had already written about it. We make assumptions on a daily basis when it comes to working with data. In this case, Abayomi assumed that no one’s name could ever be “Null”.

Yes, I’ve made that assumption too.

This sparked a memory in the deep and rusty recesses of my brain. I remember reading an article about someone whose last name is Null. It took me just a few seconds to find the article on Wired Magazine’s website, and the pullout phrase is right in the headline:

Hello, I’m Mr. Null. My Name Makes Me Invisible to Computers

It turns out that Abayomi is not alone with this chance meeting. How many of us consider the possibility that legitimate data in our databases could match reserved keywords?

The author of that article, Christopher Null, has an amusing workaround:

My usual trick is to simply add a period to my name: “Null.” This not only gets around many “null” error blocks, it also adds a sense of finality to my birthright.

In my goal to build systems that are easier to use, that should adapt to the consumer, this is clearly problematic. We shouldn’t be forcing Christopher to change his own name to fit our database model.

How would I go about dealing with this problem? Clearly when exporting data to CSV or generating extracts for other other third-party use, this problem has to be solved.

Delimiters are Good

The easiest way I can think of is for data types to be delimited correctly. When exporting to CSV format, we would ensure that all strings be surrounded by quotation marks. This way, Christopher Null’s record would show up as follows (assuming last name, first name ordering):


Then the parsing process on the reporting side would import that into whichever system without stumbling over the value.

Another issue raised by Christopher in his Wired article is that his email address is rejected outright by Bank of America because it contains the word null before the @ sign.

First and Last Name columns are bad

I’m going to call myself out on the above example and say that assuming names can fit neatly into two columns is bad. Firstly, Cher and Madonna would complain.

Secondly, Vice Admiral Horatio Nelson, 1st Viscount Nelson, 1st Duke of Bronté, would run out of space before his name could be entered correctly.

This doesn’t even begin to address names that are outside of our mainly Western perspective. In South Africa, I saw names that few developers in North America would consider, making use of all manner of punctuation (including spaces).

My recommendation here is to have a 250-character NVARCHAR column called FullName and leave it at that. If you really want your mail merge software to send “personalised” emails or letters, add an additional PreferredName column, and make sure it’s properly delimited when exporting.

Christopher Null wouldn’t have to bat an eyelid here. It would solve for his problem.

(While I’m on this matter, don’t ask for Gender or Sex. You don’t need it. If you do, you’ll know how to ask properly.)

Email Validation is Stupid

My third rule for ensuring that Nulls can exist in this world with Wests and Obawomiyes is to stop validating email addresses!

Firstly, most developers use a regular expression to do it, which is wrong and terrible and slow, and (clearly) imperfect.

This is a particular bugbear of mine, since I have one character before the @ in my personal email address. Many systems fail when trying to validate it, including WestJet. My own workaround is to create a westjet@ alias so that I can receive email from WestJet.

The best way to validate an email address is to send an email to that address with a link for someone to click on. If they don’t click on the link to close the loop, the email address isn’t valid, and your well-designed system won’t allow them to continue to the next step unless that link is clicked.


Three simple recommendations could help the Nulls find visibility again and ensure data inserted into a system is not modified unnecessarily.

For more reading on this and similar problems, I recommend visiting Bobby Tables.

If you have any experience in this area, please feel free to contact me on Twitter, at @bornsql .