Data Types and Collation

Last week we started with a very simple definition of a database: a discrete set of information, with a specific structure and order to it.

We briefly looked at normalization, which is a way to store as little of the information as possible, so that it stays unique.

We will cover more normalization as we move forward through this series, but first we will talk about how the information, or data, is stored. (This does affect normalization and relationships, even if that is not immediately clear.)

For this week’s discussion, we need to consider a spreadsheet application, like Microsoft Excel or Google Sheets.

Columns and Rows

In a spreadsheet, we have columns and rows. Usually we will also have a header row, so we can distinguish between each column.

Each column, in turn, may be formatted a certain way so that we can easily see what kind of information is in that column. For instance, we may want to format a column as a currency, with leading symbol and the two decimal places at the end.

We may left-align text values, and we may decide that numbers have no decimal places and are right-aligned. Dates and times get their own formatting.

If we were to compare this structure to that of a database, we can imagine that each sheet is a table, and each column and row is a column and row in the table.

In some applications like Microsoft Access, we may hear different terminology for columns and rows, namely fields and records. However, in SQL Server, we maintain the same convention as Excel and call them columns and rows.

Because SQL Server doesn’t care about how our data looks, we have to specify those traits when we create the table. Whether creating from scratch or from an import process through an external application (Excel, for example), we need to specify the data type for each column.

There are several key reasons why we want to do this.

In the case of numbers that will be summarized in some way (sum, average, minimum, maximum, mean, mode), we want SQL Server’s database engine to treat these as numbers internally so that it doesn’t have to convert anything, which in turn makes the calculations much faster.

The same goes for dates, times, and datetimes (where both the date and time is in one column) because the database engine understands date and time calculations, provided the data types are correct.

Text values are also very important but for a fundamentally different reason. While computers understand numbers, it’s humans that understand text.

We will focus the rest of this week’s discussion on storing strings in a database.

Collation

Imagine we are developing a database for international customers, and we need to support accented characters or an entirely different alphabet. Database systems use a catch-all term for this, and that is collation.

When we install SQL Server, we are asked what the “default” is, then we are presented with some arcane terminology which may be confusing, so we leave the defaults and click Next.

Collation has to do with how data is sorted, and thus the order in which we see it when data is returned.

Note that collation only affects text columns.

The Windows regional settings, for the user installing SQL Server, will affect the default collation of a SQL Server installation. If we were to install SQL Server on a machine that is configured with U.S. regional settings, it will have a very different default collation than a server that is set for Canada or Finland.

The default SQL Server collation for US regional settings (SQL_Latin1_General_CP1) may need to be changed to match what is required for the user databases that will be running on a server.

The above values mean the following:

General – the sort order follows 0-9, A-Z;
CP1 – code-page 1, the US English default;
Case Insensitivity and Accent Sensitivity are implied (see below).

When not using US English, or the Latin alphabet, we need to be aware that the data’s sort order is taken into account.

Even more confusingly, some vendor products require a specific collation for their database. For example, Microsoft’s own SharePoint database uses the collation Latin1_General_CI_AS_KS_WS:

CI – Case Insensitive – no difference between upper and lower case when sorting data;
AS – Accent Sensitive – distinguishes between accented characters, for instance, the Afrikaans words “sê” and “se” are considered different;
KS – Kana Sensitive – distinguishes between different Japanese character sets;
WS – Width Sensitive – distinguishes between characters that can be expressed by both single- or double-byte characters.

(Read up more about collation options here.)

Text Data Types

Now that we have a very basic grasp of collation, let’s look at text data types.

We tend to use only four text data types in SQL Server these days:

CHAR(n), NCHAR(n), VARCHAR(n), and NVARCHAR(n), where n may be a number between 1 and 8,000 or the keyword MAX.

Why 8,000?

For historic reasons, SQL Server set their data page size (the amount of storage available on each data page, including headers and footers) to 8KB many years ago. This means that the largest amount of data we can store on a single page is 8,192 bytes. Once we take away the header and the slot array at the end, we are left with slightly more than 8,000 bytes for our data.

When we store a text value, we need to decide if the characters can be expressed in a single byte or as double-byte characters (also known as Unicode, using two bytes per character). Alphabets like Kanji, Chinese (Simplified or Traditional), and Turkish, will require double-byte characters, for each character in their alphabet.

(Some code pages need more than two bytes for a character. That is outside of the scope of this discussion.)

So CHAR or VARCHAR uses one byte per character, while NCHAR and NVARCHAR uses two bytes per character (the N represents Unicode).

Thus, the longest a CHAR or VARCHAR string can be is 8000, while the longest an NCHAR or NVARCHAR string can be is 4000 (at two bytes per character).

MAX Data Type

In SQL Server 2008, several new data types were introduced, including the MAX data type for strings and binary data. The underlying storage mechanism was changed to allow columns longer than 8,000 bytes, where these would be stored in another section of the database file under certain conditions.

The MAX data type allows up to 2 GB (more than two billion bytes) for every row that column is used.

So we have to consider three distinct things when deciding how we store text: collation, Unicode, and string length.

Because my readers are terribly intelligent, you’ve already deduced that the VAR in VARCHAR means “variable length”, and you’d be correct.

We use VARCHAR (and its Unicode equivalent NVARCHAR) for columns that will contain strings with variable lengths, including names, addresses, phone numbers, product names, etc. In fact, along with INT (meaning a 4-byte integer), VARCHAR is probably the most common data type in any database today.

CHAR (and NCHAR), on the other hand, are fixed-length data types. We use this type for string lengths that are unlikely to change. For example, IMEI numbers, Vehicle Identification Numbers, social security numbers (where the dash forms part of the number), product codes, serial numbers with leading zeroes, and so on. The point here is that the length is fixed.

So why don’t we just use VARCHAR instead of CHAR everywhere?

Let’s start with why VARCHAR was introduced in the first place, and why we would use it instead of CHAR.

For columns with unpredictably long strings, we don’t want to reserve all 8,000 bytes per row for a string that may only take up 2,000 bytes—and end up wasting 6,000 (not to mention the storage required for a MAX column)—so we switch to VARCHAR, and each row only uses as many bytes as it needs.

However, SQL Server needs to keep track of the length of a VARCHAR column for each row in a table. There is a small overhead of a few bytes per row for every VARCHAR for SQL Server to keep track of this length. The reason we don’t replace CHAR and NCHAR outright, is ironically to save space.

It doesn’t make sense for a table containing millions or billions of rows to use VARCHAR for fixed-length columns because we would be adding on another few bytes per row as unnecessary overhead. Adding just one byte per million rows is roughly 1 MB of storage.

Extrapolating that extra byte to the memory required to hold it, maintenance plans when updating indexes and statistics, backups, replicated databases, and so on, we are now looking at extra megabytes, and possibly gigabytes, for the sake of convenience.

We must make sure that we pick the correct character type for storing strings, beyond just the length of the string. Both CHAR and VARCHAR have their place.

While we did spend most of this discussion on collations and text, we’ve only scratched the surface.

Next week, we will discuss how to pick the right data type for your columns, with concrete examples. This matters a lot with how numbers are stored.

If you have any feedback, find me on Twitter at @bornsql.