Why You Need A Test Environment

In 2017, there’s no excuse not to have at least a testing environment, and preferably a development environment as well, in order to create and deploy code changes (software, database, you name it) outside of a production environment. If you’re making changes in a production environment without testing them first, you’re doing it wrong.

Some companies I’ve worked with have different forms of testing environments, including QA (Quality Assurance), IAT (Internal Acceptance Testing), and UAT (User Acceptance Testing). What they are called doesn’t matter, so long as they exist.

In a typical IT deployment, whether using Waterfall, Agile, or other development methodologies of the month, it pays to have a basic developmenttestproduction deployment path.

Development

This environment can be a general playground, building things, writing code, installing weird and wonderful third-party plugins, getting the job done as far as producing new things is concerned.

In the development environment (which might be a shared machine, or a bunch of sticker-encrusted laptops used by developers), there should be something called Source Control.

Entire books have been written about source control, but the premise is that development environment is a cesspit of trouble, so any code that is written and compiled (without breaking the build), should be checked into source control as soon as possible.

This serves three purposes:

1) Other developers can access the new code without having to plug in a USB drive and risk malware infection on Steve’s laptop (sorry, Steve).

2) If Steve’s laptop is stolen, or he spills coffee on the keyboard, or it gets infected by malware because he doesn’t keep the operating system up to date, the source code is checked in safely somewhere else, and the team doesn’t lose what could amount to days of productivity.

3) The third purpose is for something called Continuous Integration, which is a fancy way of saying that as soon as code is checked into source control, an automated testing process is able to compile the code, deploy it somewhere, and run a battery of tests against it without human intervention. CI is much faster than manual testing.

It’s imperative to keep in mind that all software contains bugs. We may not have a Steve in our company, but developers always make assumptions—they have to; it’s part of the job—so little things can escalate into big things if the software is not used as intended.

Test

The test environment should be set up exactly like production, down to hardware and operating system configuration. The only differences should be application configuration that might point to a test vs. a production database server, for instance.

Why is it so important to make sure that test and production look and behave the same?

In the case of SQL Server for example, a simple thing like data distribution in a table, along with the amount of RAM or CPU cores, will cause a specific query plan to be selected. If the same query is deployed to production with a different underlying hardware configuration or data distribution, a worse query plan might be selected by the query optimizer, and the developers are going to have to work nights and weekends to fix it.

Yeah, that'd be great
Yeah, that’d be great

Whose fault is it, though? In this particular case, it’s the fault of whoever set up the test environment, because it doesn’t mirror production.

When I mentioned the different types of test environments above, this is what they are for:

  • Internal Acceptance Testing – the developers and possibly the product owners have tested the code and it works like the functional specification says it should.
  • User Acceptance Testing – the target audience (users) get their hands on the code and do the things that you never expected them to, and often will go off script or off spec. I worked with a guy called Gordon who I knew would break something. Make your code Gordon-proof.
  • Quality Assurance – the code is tested by a dedicated QA team who test the code against a functional specification as well as a set of guidelines created by the company to ensure software meets certain standards, but don’t make assumptions about it (developers always make assumptions). Think about Microsoft products. Different teams write Word and Excel, but they have to work in similar ways and have a high level of confidence that data won’t go missing or get corrupted.

Usually, there’s a bit of a cycle in the process here, where if something fails in test, it has to be fixed in development and redeployed, until it passes muster.

Production

Once code has been signed off by the end users and QA, it can be deployed into production. This can be done in many ways, and unfortunately one of those ways is that Steve has a USB drive containing files that he transferred from his coffee-and-malware-infested laptop.

Deployment to production should be the exact same process as deployment to test. Why? Because if it is not, then assumptions are being made and things can go wrong.

Remember source control? The best deployments are automated deployments. They are scripted, which makes them repeatable and therefore predictable. An automated process should be able to grab code from source control, compile it, and deploy it, the same way every time.

Deployment

Most companies will have a maintenance window during which deployments can take place. These are usually outside of working hours and on a weekend, but in my experience, weekends are a terrible time to deploy new code, so I prefer it to happen on a Wednesday night. That way, Steve is most likely in the office on Thursday morning to fix his bad code.

Source control should be the single version of the truth. If a developer is deploying from a thumb drive, stop doing that right now. There are free and cheap solutions available that can automate deployments for you that require only a small amount of effort to achieve repeatable success.

Summary

Source Control: Get it now. If you want to use an online provider (which I recommend), check out GitHub (mostly free), Visual Studio Team Services (free for up to five developers), or BitBucket (mostly free). There are probably hundreds more.

Continuous Integration: Check it out. Visual Studio Team Services has it built in, even with the free version. Remember, the whole point of this is automated builds and repeatable results.

If you would like to discuss this with me further, you can find me on Twitter at @bornsql.

How to write an UPDATE query

My First UPDATE Statement

Last week we covered how to put information into a table using an INSERT statement.

This week we will learn how to make changes to data that is already in a table using an UPDATE statement. We are also going to learn all about why the WHERE clause is so important.

UPDATE

Updating information stored in a database can be as simple as identifying which column to update and specifying the new data to write in that column:

UPDATE [dbo].[Stores]
SET [StoreName] = N'West Edmonton Mall';
GO

The problem with this particular UPDATE statement, though, is that it will update every row, when we may only want to change one record. That’s why we need to add WHERE. The purpose of a WHERE clause is to restrict the data modification using a filter or limit.

The WHERE clause

In ANSI SQL (remember, ANSI is the standards organisation that defines how SQL works with different database platforms), a WHERE clause goes somewhere near the end of a query, to filter or limit the data being affected.

The WHERE keyword can be used on all four basic data manipulation queries: SELECT, INSERT, UPDATE and DELETE. The reason we’re only learning about it today, is that it is a critical part of the UPDATE and DELETE statements.

Limiting the damage

An UPDATE statement without a WHERE condition, or clause, will update the entire table.

Consider these two examples, and imagine what is happening in the table:

-- Without a WHERE clause
UPDATE [dbo].[Stores]
SET [StoreName] = N'West Edmonton Mall';
GO

-- With a WHERE clause
UPDATE [dbo].[Stores]
SET [StoreName] = N'West Edmonton Mall'
WHERE [StoreID] = 2;
GO;

In the first example, every single row in our table will now have ‘West Edmonton Mall’ for the store name. In the second example, only the row (or rows) that match the condition in the WHERE clause will be updated.

Notice how the WHERE condition relates to a value in the [StoreID] column, which is this table’s Primary Key. When updating individual rows in a table, it’s always better to use a unique value to guarantee that we only update a single row.

We could use a WHERE condition on a different column, or combinations of columns, using AND and OR logic statements. WHERE clauses can be extremely complex.

Note: the UPDATE statement will update whatever we tell it to. Even though the original value of the column was ‘West Edmonton Mall’, the database engine will happily update that value again and again, to the same string, if we ask it to.

SET

Astute readers will recognise a new keyword in the UPDATE statement: SET.

The first part of an UPDATE specifies the table we wish to modify.

Then, the SET keyword specifies the column we want to change. We can add more than one column to our SET clause, separated by commas, if we have more than one column in the same row (or rows) that we wish to update.

For instance, let’s assume the West Edmonton Mall store has a new manager. We can modify the [ManagerName] and [ManagerEmail] columns at the same time.

UPDATE [dbo].[Stores]
SET [ManagerName] = N'Wesley Wensleydale',
[ManagerEmail] = N'wesley@example.com'
WHERE [StoreID] = 2;
GO;

This operation, where more than one thing is being changed at the same time, is called a set-based operation. In other words, a set of data is being updated at once, instead of writing an UPDATE statement for every single column.

Set-based operations can run on the entire table, or on a filtered set of data specified by the WHERE clause.

Ambiguity Verboten

Unlike the INSERT statement, where column names were implied, an UPDATE statement has to explicitly list the column (or columns) that we are updating. In our examples above, we had [StoreName], [ManagerName] and [ManagerEmail]. This is because there can be absolutely no ambiguity when modifying data.

This is also why a WHERE clause is so important. I have personally run an UPDATE or DELETE without adding a WHERE clause, and it happens a lot in this field. Make sure to add a WHERE keyword before writing the rest of the statement.

Some tools that plug into SQL Server Management Studio will detect if we have missed a WHERE clause on an UPDATE or DELETE, but we can’t always rely on plugins. For example, we might have to use a different computer one day, or write our SQL code in a text editor like Notepad, and only good habits can avoid disaster.

The Final Results

If you’re following along in SQL Server Management Studio, we can run the three-column UPDATE statement like so:

UPDATE [dbo].[Stores]
SET [StoreName] = N'West Edmonton Mall',
[ManagerName] = N'Wesley Wensleydale',
[ManagerEmail] = N'wesley@example.com'
WHERE [StoreID] = 2;
GO

Once we have executed the statement (using F5 on the keyboard, or clicking the Execute button in the toolbar), we see a message in the Messages pane:

(1 row(s) affected)

Using our recently-learned skill to SELECT the data from this table, we will see the new result set, containing the new row:

Congratulations! We have modified a row of data in our table, and a SELECT query proved that it was inserted.

Next time, we will be removing data from a table using the DELETE command. Stay tuned.

Look me up on Twitter if you want to ask any questions about this series, at @bornsql.

Normalization, The Sequel

If there’s one thing that SQL Server is really good at, it’s relationships. After all, a relational database management system without the relationships is nothing more than a place to store your stuff.

Last week we briefly looked at a denormalized table, and then I suggested that breaking it up into five separate tables would be a good idea. So good, in fact, that it took me more than 2,000 words to explain just the first table in our highly contrived example.

Assuming you have read through all those words, let’s attempt a much more condensed look at the other four tables. If you recall, we had:

  • Transactions
  • Products
  • Customers
  • Salespersons
  • Stores

We tackled the Stores table first because everything is backwards when we design databases.

For the next three tables, I’m going to just show you how I would design them, without going into detail. Take a close look at Salespersons, though (which we’ll look at first) because it will give you a clue about how we link all the tables together finally in the Transaction table.

Then take a look at … PaymentTypes? ProductTypes? Colours? Categories? Sizes? Uh … What’s going on here? Where did all those tables come from? Luckily, T-SQL allows comments, which you’ll see below.

CREATE TABLE [Salespersons] (
[SalespersonID] SMALLINT NOT NULL IDENTITY(1,1),
[StoreID] SMALLINT NOT NULL,
[FirstName] NVARCHAR(255) NOT NULL,
[LastName] NVARCHAR(255) NOT NULL,
[EmailAddress] NVARCHAR(512) NOT NULL
CONSTRAINT [PK_Salespersons] PRIMARY KEY CLUSTERED ([SalespersonID] ASC)
);

-- List of possible payment types, (e.g. credit card, cheque)
CREATE TABLE [PaymentTypes] (
[PaymentTypeID] TINYINT NOT NULL IDENTITY(1,1),
[Description] VARCHAR(255) NOT NULL,
CONSTRAINT [PK_PaymentTypes] PRIMARY KEY CLUSTERED ([PaymentTypeID] ASC)
);

CREATE TABLE [Customers] (
[CustomerID] BIGINT NOT NULL IDENTITY(1,1),
[FirstName] NVARCHAR(255) NOT NULL,
[LastName] NVARCHAR(255) NOT NULL,
[EmailAddress] NVARCHAR(512) NOT NULL,
[Telephone] VARCHAR(25) NOT NULL,
[PaymentTypeID] TINYINT NOT NULL,
CONSTRAINT [PK_Customers] PRIMARY KEY CLUSTERED ([CustomerID] ASC)
);

-- List of possible product types (e.g. iPhone, iPhone cover, iPod)
CREATE TABLE [ProductTypes] (
[ProductTypeID] TINYINT NOT NULL IDENTITY(1,1),
[Description] VARCHAR(255) NOT NULL,
CONSTRAINT [PK_ProductTypes] PRIMARY KEY CLUSTERED ([ProductTypeID] ASC)
);

-- List of possible colours
CREATE TABLE [Colours] (
[ColourID] TINYINT NOT NULL IDENTITY(1,1),
[Description] VARCHAR(255) NOT NULL,
CONSTRAINT [PK_Colours] PRIMARY KEY CLUSTERED ([ColourID] ASC)
);

-- List of possible categories (5.5", 4.7", 4", 3.5")
-- This replaces "size", since we might use Size to denote storage
CREATE TABLE [Categories] (
[CategoryID] TINYINT NOT NULL IDENTITY(1,1),
[Description] VARCHAR(255) NOT NULL,
CONSTRAINT [PK_Categories] PRIMARY KEY CLUSTERED ([CategoryID] ASC)
);

-- List of possible sizes ("8GB", "16GB", "32GB", etc.)
-- Can also be used for other product types like laptops
CREATE TABLE [Sizes] (
[SizeID] TINYINT NOT NULL IDENTITY(1,1),
[Description] VARCHAR(255) NOT NULL,
CONSTRAINT [PK_Sizes] PRIMARY KEY CLUSTERED ([SizeID] ASC)
);

CREATE TABLE [Products] (
[ProductID] TINYINT NOT NULL IDENTITY(1,1),
[ProductTypeID] TINYINT NOT NULL,
[ColourID] TINYINT NOT NULL,
[CategoryID] TINYINT NOT NULL,
[SizeID] TINYINT NOT NULL,
[SellingPrice] SMALLMONEY NOT NULL,
CONSTRAINT [PK_Products] PRIMARY KEY CLUSTERED ([ProductID] ASC)
);

Those tables popped out of nowhere, didn’t they? Welcome to the world of normalization. To design a database properly, we come to the realisation that we can simplify the data input even more, reducing repeated values, and finding the most unique way of representing data. Products comprise product types. A single list of colours can be reused in various places. Payment types can be used for all sorts of transactional data.

We end up with a lot of tables when we normalize a database, and this is perfectly normal. When we want to read information out of the system the way senior management wants, we must join all these tables together.

The only way to join tables together in a safe and meaningful way is with foreign key relationships, where one table’s primary key is referenced in another table, with a matching data type.

The SalesPersons table has a StoreID column. As it stands, there’s no relationship between Stores and SalesPersons until we create the relationship using T-SQL.

ALTER TABLE [SalesPersons]
ADD CONSTRAINT FK_SalesPersons_Stores FOREIGN KEY (StoreID)
REFERENCES [Stores] (StoreID);

  • Line 1: Inform SQL Server that we are altering an existing table
  • Line 2: By adding a foreign key constraint (i.e. limiting what can go into the StoreID column)
  • Line 3: By forcing it to use the values from the Stores table’s StoreID column (i.e. the primary key).

In a relationship diagram, it looks like this (in SQL Server Management Studio’s Database Diagram tool):

The yellow key in each table is the Primary Key (StoreID and SalespersonID respectively). There is a StoreID column in both tables with the same data type (SMALLINT). The foreign key (FK) does not have to match the name of the primary key (PK), but it makes things a lot easier to have the same name for both sides of a relationship in large databases, so it’s a good habit.

Notice the direction of the relationship (FK_Salespersons_Stores) in the picture, with the yellow key on the table with the Primary Key. The name of the relationship is also sensible. To the casual eye, this says that there’s a Foreign Key constraint in the Salespersons table that points to the Primary Key in the Stores table.

Now we see why data types are so important with relational data. A relationship is not even possible between two tables if the data type is not the same in both key columns.

With this constraint enabled, whenever we insert data into the Salespersons table, we have to make sure that whatever we put into the StoreID column must already exist in the Stores table.

Let’s do the rest of the relationships so far, and then we’ll look at the Transactions table.

ALTER TABLE [Customers]
ADD CONSTRAINT FK_Customers_PaymentTypes FOREIGN KEY (PaymentTypeID)
REFERENCES [PaymentTypes] (PaymentTypeID);

ALTER TABLE [Products]
ADD CONSTRAINT FK_Products_ProductTypes FOREIGN KEY (ProductTypeID)
REFERENCES [ProductTypes] (ProductTypeID);

ALTER TABLE [Products]
ADD CONSTRAINT FK_Products_Colours FOREIGN KEY (ColourID)
REFERENCES [Colours] (ColourID);

ALTER TABLE [Products]
ADD CONSTRAINT FK_Products_Categories FOREIGN KEY (CategoryID)
REFERENCES [Categories] (CategoryID);

ALTER TABLE [Products]
ADD CONSTRAINT FK_Products_Sizes FOREIGN KEY (SizeID)
REFERENCES [Sizes] (SizeID);

As we can see now, we have more than one foreign key relationship in the Products table, to ProductTypes, Colours, Categories, and Sizes, which is a clue to how the Transactions table will look.

CREATE TABLE [Transactions] (
[TransactionID] BIGINT NOT NULL IDENTITY(1,1),
[TransactionDate] DATETIME2(3) NOT NULL,
[ProductID] TINYINT NOT NULL,
[DiscountPercent] DECIMAL(4,2) NOT NULL DEFAULT(0),
[SalesPersonID] SMALLINT NOT NULL,
[CustomerID] BIGINT NOT NULL,
[HasAppleCare] BIT NOT NULL DEFAULT(0),
CONSTRAINT [PK_Transactions] PRIMARY KEY CLUSTERED ([TransactionID] ASC)
);

Let’s assume we’ve created all the relationships as well, so that we’re left with the following relationships (click to enlarge):

Ten tables, compared to our original one denormalized table, is a significant increase in number of tables. However, let’s compare the data usage for adding transactions.

If we were to populate each row of each table with enough data to provide the same number of transactions as last week’s two purchases, it would look like this.

Stores table: 172 bytes

  • StoreID: 2 bytes (IDENTITY value as SMALLINT)
  • StoreCode: 0 bytes (NULL)
  • StoreName: 30 bytes (“Chinook Centre” is 14 characters long including the space, so with Unicode that becomes 28 bytes, plus 2 for the overhead of using NVARCHAR)
  • Address: 86 bytes (“6455 Macleod Trail SW, Calgary, AB T2H 0K8” is 42 characters, so 84 with Unicode, plus 2 bytes for NVARCHAR overhead)
  • ManagerName: 22 bytes (“Bob Bobson” in Unicode, plus 2 bytes NVARCHAR overhead)
  • ManagerEmail: 32 bytes (“bob@example.com” in Unicode, plus 2 bytes NVARCHAR overhead)

Salespersons table: 70 bytes

  • SalespersonID: 2 bytes
  • StoreID: 2 bytes
  • FirstName: 14 bytes (“Thandi” in Unicode + 2 bytes)
  • LastName: 14 bytes (“Funaki” in Unicode + 2 bytes)
  • EmailAddress: 38 bytes (“thandi@example.com” in Unicode + 2 bytes)

PaymentTypes table: 14 bytes

  • PaymentTypeID: 1 byte
  • Description: 13 bytes (“Credit Card” + 2 bytes)

ProductTypes table: 23 bytes

  • ProductTypeID: 1 byte
  • Description: 8 bytes (“iPhone” + 2 bytes)
  • ProductTypeID: 1 byte
  • Description: 13 bytes (“iPhone Case” + 2 bytes)

Colours table: 21 bytes

  • ColourID: 1 byte
  • Description: 7 bytes (“Black” + 2 bytes)
  • ColourID: 1 byte
  • Description: 6 bytes (“Blue” + 2 bytes)
  • ColourID: 1 byte
  • Description: 5 bytes (“Red” + 2 bytes)

Categories table: 7 bytes

  • CategoryID: 1 byte
  • Description: 6 bytes (“5.5″” + 2 bytes)

Sizes table: 16 bytes

  • SizeID: 1 byte
  • Description: 7 bytes (“128GB” + 2 bytes)
  • SizeID: 1 byte
  • Description: 7 bytes (“256GB” + 2 bytes)

Products table: 27 bytes (3 products at 9 bytes each)

  • ProductID: 1 byte
  • ProductTypeID: 1 byte
  • ColourID: 1 byte
  • CategoryID: 1 byte
  • SizeID: 1 byte
  • SellingPrice: 4 bytes

Customers table: 192 bytes

  • CustomerID: 8 bytes
  • FirstName: 10 bytes (“I.M.” in Unicode + 2 bytes)
  • LastName: 18 bytes (“Customer” in Unicode + 2 bytes)
  • EmailAddress: 42 bytes (“customer@example.com” in Unicode + 2 bytes)
  • Telephone: 16 bytes (“(403) 555-1212” + 2 bytes)
  • PaymentTypeID: 1 byte
  • CustomerID: 8 bytes
  • FirstName: 10 bytes (“U.R.” in Unicode + 2 bytes)
  • LastName: 18 bytes (“Customer” in Unicode + 2 bytes)
  • EmailAddress: 44 bytes (“customer2@example.com” in Unicode + 2 bytes)
  • Telephone: 16 bytes (“(403) 665-0011” + 2 bytes)
  • PaymentTypeID: 1 byte

Transactions table: 96 bytes (32 bytes per transaction)

  • TransactionID: 8 bytes
  • TransactionDate: 7 bytes
  • ProductID: 1 byte
  • DiscountPercent: 5 bytes
  • SalesPersonID: 2 bytes
  • CustomerID: 8 bytes
  • HasAppleCare: 1 bit (expands to 1 byte)

GRAND TOTAL: 638 bytes to represent all three transactions

The denormalized version, for which we can see the original example below, works out as follows. Recall we said last week that each column was NVARCHAR(4000), or possibly even NVARCHAR(MAX).

A Wide Table
A Wide Table Appears – click to enlarge

At our most generous, we would need 1,166 bytes to record these three transactions. That’s almost double the data required, just for these three. Plus, the data has no foreign key relationships, so we cannot be sure that whatever is being added to the denormalized table is valid.

As time goes on, the normalized tables will grow at a much lower rate proportionally. Consider what a denormalized Transactions table would look like with an average row size of 388 bytes, for ten million rows (3.6GB).

Compare that to a normalized database, with ten million transactions for 8 million customers. Even assuming we have a hundred products, with twenty colours, and 30 product types, we would see only around 1GB of space required to store the same data.

We know Apple as being one of the most successful technology companies in terms of sales, so extrapolating to 1 billion transactions, we’d be comparing 361GB (for the denormalized table) with less than half that (178GB) if every single customer was unique and only ever bought one item.

Aside from the staggering savings in storage requirements, normalization gives us sanity checks with data validation by using foreign key constraints. Along with proper data type choices, we have an easy way to design a database properly from the start.

Sure, it takes longer to start, but the benefits outweigh the costs almost immediately. Less storage, less memory to read the data, less space for backups, less time to run maintenance tasks, and so on.

Normalization matters.

Next week, we talk briefly about bits and bytes, and then we will start writing queries. Stay tuned.

Find me on Twitter to discuss your favourite normalization story at @bornsql.

A First Look At Normalization

Phew! There’s a lot to take in with data types, collation, precision, scale, length, and Unicode, and we’re just getting warmed up. This week’s post is over 2,000 words long!

Over the last three weeks, we’ve gone fairly deep into data types, and now we are going to see how they come into play with normalization.

If we go back to the first post in this series, I mentioned normalization, and then apparently I forgot about it in the next two posts. What you didn’t see is that I was talking about it all along.

Continue reading “A First Look At Normalization”

Fundamentals of Data Types

Last week, we discussed storing text in a database. This week we will dive deeper into data types.

When storing data in our database, we want to make sure that it’s stored accurately and that we only use the required amount of space.

This is because when we access the data later, we want to make sure any calculations are accurate; plus reading the data takes up memory, and we want to be as efficient as we can with memory usage.

There are seven data type categories in SQL Server:

  • exact numerics
  • approximate numerics
  • date and time
  • character strings
  • Unicode character strings
  • binary strings
  • other

When we want to use these data types for our columns, we need to declare them. Some require a length, some require a precision and scale, and some can be declared without a length at all. For example:

No Length (implied in data type):
DECLARE @age AS TINYINT;

Explicit Length (length is supplied):
DECLARE @firstName AS VARCHAR(255);

Precision and Scale:
DECLARE @interestRate AS DECIMAL(9,3);

Let’s talk a bit about precision and scale, because those values between the brackets may not work the way we think they do.

Precision and Scale

Data types with decimal places are defined by what we call fixed precision and scale. Let’s look at an example:

123,456.789

In the above number, we see a six-digit number (ignoring the thousand separator) followed by a decimal point, and then a fraction represented by three decimal places. This number has a scale of 3 (the digits after the decimal point) and a precision of 9 (the digits for the entire value, on both sides of the decimal point). We would declare this value as DECIMAL(9,3).

This is confusing at first glance, because we have to declare it “backwards”, with the precision first, and then the scale. It may be easier to think of the precision in the same way we think of a character string’s length.

Date and time data types can also have decimal places, and SQL Server supports times accurate to the nearest 100 nanoseconds. The most accurate datetime is DATETIME2(7), where 7 decimal places are reserved for the time.

Before SQL Server 2008, we used DATETIME, which is only accurate to the nearest 3 milliseconds, and uses 8 bytes. A drop-in replacement for this is DATETIME2(3), using 3 decimal places, and accurate to the nearest millisecond. It only needs 7 bytes per column.

Be mindful that, as higher precision and scale are required, a column’s storage requirement increases. Accuracy is a trade-off with disk space and memory, so we may find ourselves using floating point values everywhere.

However, in cases where accuracy is required, always stick to exact numerics. Financial calculations, for example, should always use DECIMAL and MONEY data types.

Exact Numerics

Exact Numerics are exact, because any value that is stored is the exact same value that is retrieved later. These are the most common types found in a database, and INT is the most prevalent.

Exact numerics are split up into integers (BIGINT, INT, SMALLINT, TINYINT, BIT) and decimals (NUMERIC, DECIMAL, MONEY, SMALLMONEY). Decimals have decimal places (defined by precision and scale), while integers do not.

Integers have fixed sizes (see table below), so we don’t need to specify a length when declaring this data type.

Type Bytes Range
BIGINT 8 bytes -2^63 to 2^63-1
INT 4 bytes -2^31 to 2^31-1
SMALLINT 2 bytes -2^15 to 2^15-1
TINYINT 1 byte 0 to 255
BIT 1 bit 0 to 1
  • BIT is often used for storing Boolean values, where 1 = True and 0 = False.
  • Yes, BIGINT can store numbers as large as 2 to the power of 63 minus 1. That’s 19 digits wide, with a value of 9,223,372,036,854,775,807, or 9.2 quintillion.

Decimals may vary depending on the precision and scale, so we have to specify those in the declaration.

Type Bytes Range
DECIMAL 5 to 17 bytes Depends on precision and scale.
38 digits is the longest possible precision.
NUMERIC
  • DECIMAL and NUMERIC are synonyms and can be used interchangeably. Read more about this data type, and how precision and scale affects bytes used, here.

Although the MONEY and SMALLMONEY data types do have decimal places, they don’t require the precision and scale in the declaration because these are actually synonyms for DECIMAL(19,4) and DECIMAL(10,4) respectively. Think of these data types for convenience more than anything.

Type Bytes Range
MONEY 8 bytes -922,337,203,685,477.5808 to 922,337,203,685,477.5807
SMALLMONEY 4 bytes -214,748.3648 to 214,748.3647

Approximate Numerics

Approximate Numerics mean that the value stored is only approximate. Floating point numbers would be classified as approximate numerics, and these comprise FLOAT and REAL.

Declaring a FLOAT requires a length, which represents the number of bits used to store the mantissa. REAL is a synonym of FLOAT(24).

The mantissa means the significant digits of a number in scientific notation, which is how floating point numbers are represented. The default is FLOAT(53). Generally, we stick to the defaults, and use REAL if we want to save space, forgoing some accuracy of the larger FLOAT(53).

Type Bytes Range
FLOAT 4 or 8 bytes -1.79E+308 to -2.23E-308, 0 (zero),
and 2.23E-308 to 1.79E+308
REAL 4 bytes -3.40E+38 to -1.18E-38, 0 (zero),
and 1.18E-38 to 3.40E+38

Date and Time

Date and time data types are slightly more complex. For storing dates (with no time), we use DATE. We store times (with no dates) using TIME. For storing both date and time in the same column, we can use DATETIME2, DATETIME, or SMALLDATETIME. Finally, we can even store timezone-aware values comprising a date and time and timezone offset, using DATETIMEOFFSET.

DATETIME2, TIME, and DATETIMEOFFSET take a length in their declarations, otherwise they default to 7 (accurate to the nearest 100 nanoseconds).

Character Strings

As we saw last week, characters can be fixed-length (CHAR) or variable-length (VARCHAR), and can support special Unicode character types (NCHAR and NVARCHAR respectively). Collation should also be taken into account.

Length can be 1 to 8000 for CHAR and VARCHAR, or 1 to 4000 for NCHAR and NVARCHAR. For storing values larger than that, see the Large Objects section below.

Binary Strings

Sometimes we want to store binary content in a database. This might be a JPEG image, a Word document, an SSL certificate file, or anything that could traditionally be saved on the file system. SQL Server provides the BINARY and VARBINARY data types for this (and IMAGE for backward compatibility).

Length can be 1 to 8000 for BINARY and VARBINARY. For storing values larger than that, see the Large Object section below.

Large Objects

SQL Server 2008 introduced a new MAX length for several data types, including CHAR, NCHAR, VARCHAR, NVARCHAR, BINARY and VARBINARY.

(The XML data type uses MAX under the covers as well.)

This new specification allows up to 2 GB of data to be stored in a column with that declared length. We should take care not to use 2 GB when inserting data into these columns, but it provides greater flexibility when inserting more than 8000 bytes into one of these columns.

Other Data Types

SQL Server supports other types of data, which fall outside the scope of text and numerics. These include CURSOR, TABLE, XML, UNIQUEIDENTIFIER, TIMESTAMP (not to be confused with the date and time types), HIERARCHYID, SQL_VARIANT, and Spatial Types (GEOGRAPHY and GEOMETRY).

Next week, we will see how normalization and data types work together, now that we have a good overview of the different data types in a database.

If you have any thoughts or comments, please find me on Twitter at @bornsql.

Data Types and Collation

Last week we started with a very simple definition of a database: a discrete set of information, with a specific structure and order to it.

We briefly looked at normalization, which is a way to store as little of the information as possible, so that it stays unique.

We will cover more normalization as we move forward through this series, but first we will talk about how the information, or data, is stored. (This does affect normalization and relationships, even if that is not immediately clear.)

For this week’s discussion, we need to consider a spreadsheet application, like Microsoft Excel or Google Sheets.

Columns and Rows

In a spreadsheet, we have columns and rows. Usually we will also have a header row, so we can distinguish between each column.

Each column, in turn, may be formatted a certain way so that we can easily see what kind of information is in that column. For instance, we may want to format a column as a currency, with leading symbol and the two decimal places at the end.

We may left-align text values, and we may decide that numbers have no decimal places and are right-aligned. Dates and times get their own formatting.

If we were to compare this structure to that of a database, we can imagine that each sheet is a table, and each column and row is a column and row in the table.

In some applications like Microsoft Access, we may hear different terminology for columns and rows, namely fields and records. However, in SQL Server, we maintain the same convention as Excel and call them columns and rows.

Because SQL Server doesn’t care about how our data looks, we have to specify those traits when we create the table. Whether creating from scratch or from an import process through an external application (Excel, for example), we need to specify the data type for each column.

There are several key reasons why we want to do this.

In the case of numbers that will be summarized in some way (sum, average, minimum, maximum, mean, mode), we want SQL Server’s database engine to treat these as numbers internally so that it doesn’t have to convert anything, which in turn makes the calculations much faster.

The same goes for dates, times, and datetimes (where both the date and time is in one column) because the database engine understands date and time calculations, provided the data types are correct.

Text values are also very important but for a fundamentally different reason. While computers understand numbers, it’s humans that understand text.

We will focus the rest of this week’s discussion on storing strings in a database.

Collation

Imagine we are developing a database for international customers, and we need to support accented characters or an entirely different alphabet. Database systems use a catch-all term for this, and that is collation.

When we install SQL Server, we are asked what the “default” is, then we are presented with some arcane terminology which may be confusing, so we leave the defaults and click Next.

Collation has to do with how data is sorted, and thus the order in which we see it when data is returned.

Note that collation only affects text columns.

The Windows regional settings, for the user installing SQL Server, will affect the default collation of a SQL Server installation. If we were to install SQL Server on a machine that is configured with U.S. regional settings, it will have a very different default collation than a server that is set for Canada or Finland.

The default SQL Server collation for US regional settings (SQL_Latin1_General_CP1) may need to be changed to match what is required for the user databases that will be running on a server.

The above values mean the following:

  • General – the sort order follows 0-9, A-Z;
  • CP1 – code-page 1, the US English default;
  • Case Insensitivity and Accent Sensitivity are implied (see below).

When not using US English, or the Latin alphabet, we need to be aware that the data’s sort order is taken into account.

Even more confusingly, some vendor products require a specific collation for their database. For example, Microsoft’s own SharePoint database uses the collation Latin1_General_CI_AS_KS_WS:

  • CICase Insensitive – no difference between upper and lower case when sorting data;
  • ASAccent Sensitive – distinguishes between accented characters, for instance, the Afrikaans words “sê” and “se” are considered different;
  • KSKana Sensitive – distinguishes between different Japanese character sets;
  • WSWidth Sensitive – distinguishes between characters that can be expressed by both single- or double-byte characters.

(Read up more about collation options here.)

Text Data Types

Now that we have a very basic grasp of collation, let’s look at text data types.

We tend to use only four text data types in SQL Server these days:

CHAR(n), NCHAR(n), VARCHAR(n), and NVARCHAR(n), where n may be a number between 1 and 8,000 or the keyword MAX.

Why 8,000?

For historic reasons, SQL Server set their data page size (the amount of storage available on each data page, including headers and footers) to 8KB many years ago. This means that the largest amount of data we can store on a single page is 8,192 bytes. Once we take away the header and the slot array at the end, we are left with slightly more than 8,000 bytes for our data.

When we store a text value, we need to decide if the characters can be expressed in a single byte or as double-byte characters (also known as Unicode, using two bytes per character). Alphabets like Kanji, Chinese (Simplified or Traditional), and Turkish, will require double-byte characters, for each character in their alphabet.

(Some code pages need more than two bytes for a character. That is outside of the scope of this discussion.)

So CHAR or VARCHAR uses one byte per character, while NCHAR and NVARCHAR uses two bytes per character (the N represents Unicode).

Thus, the longest a CHAR or VARCHAR string can be is 8000, while the longest an NCHAR or NVARCHAR string can be is 4000 (at two bytes per character).

MAX Data Type

In SQL Server 2008, several new data types were introduced, including the MAX data type for strings and binary data. The underlying storage mechanism was changed to allow columns longer than 8,000 bytes, where these would be stored in another section of the database file under certain conditions.

The MAX data type allows up to 2 GB (more than two billion bytes) for every row that column is used.

So we have to consider three distinct things when deciding how we store text: collation, Unicode, and string length.

Because my readers are terribly intelligent, you’ve already deduced that the VAR in VARCHAR means “variable length”, and you’d be correct.

We use VARCHAR (and its Unicode equivalent NVARCHAR) for columns that will contain strings with variable lengths, including names, addresses, phone numbers, product names, etc. In fact, along with INT (meaning a 4-byte integer), VARCHAR is probably the most common data type in any database today.

CHAR (and NCHAR), on the other hand, are fixed-length data types. We use this type for string lengths that are unlikely to change. For example, IMEI numbers, Vehicle Identification Numbers, social security numbers (where the dash forms part of the number), product codes, serial numbers with leading zeroes, and so on. The point here is that the length is fixed.

So why don’t we just use VARCHAR instead of CHAR everywhere?

Let’s start with why VARCHAR was introduced in the first place, and why we would use it instead of CHAR.

For columns with unpredictably long strings, we don’t want to reserve all 8,000 bytes per row for a string that may only take up 2,000 bytes—and end up wasting 6,000 (not to mention the storage required for a MAX column)—so we switch to VARCHAR, and each row only uses as many bytes as it needs.

However, SQL Server needs to keep track of the length of a VARCHAR column for each row in a table. There is a small overhead of a few bytes per row for every VARCHAR for SQL Server to keep track of this length. The reason we don’t replace CHAR and NCHAR outright, is ironically to save space.

It doesn’t make sense for a table containing millions or billions of rows to use VARCHAR for fixed-length columns because we would be adding on another few bytes per row as unnecessary overhead. Adding just one byte per million rows is roughly 1 MB of storage.

Extrapolating that extra byte to the memory required to hold it, maintenance plans when updating indexes and statistics, backups, replicated databases, and so on, we are now looking at extra megabytes, and possibly gigabytes, for the sake of convenience.

We must make sure that we pick the correct character type for storing strings, beyond just the length of the string. Both CHAR and VARCHAR have their place.

While we did spend most of this discussion on collations and text, we’ve only scratched the surface.

Next week, we will discuss how to pick the right data type for your columns, with concrete examples. This matters a lot with how numbers are stored.

If you have any feedback, find me on Twitter at @bornsql.

My surname is NULL

Last Wednesday on Twitter, Abayomi Obawomiye (@SQLAmerica) wrote:

I just met someone with the last name NULL today. I really want to ask if they had issues with the last name but worried might be offensive

Abayomi goes on further to say that the word “NULL” gets replaced with a space in any CSV files generated for reports, which got me thinking about this problem.

Then I realised I had already written about it. We make assumptions on a daily basis when it comes to working with data. In this case, Abayomi assumed that no one’s name could ever be “Null”.

Yes, I’ve made that assumption too.

This sparked a memory in the deep and rusty recesses of my brain. I remember reading an article about someone whose last name is Null. It took me just a few seconds to find the article on Wired Magazine’s website, and the pullout phrase is right in the headline:

Hello, I’m Mr. Null. My Name Makes Me Invisible to Computers

It turns out that Abayomi is not alone with this chance meeting. How many of us consider the possibility that legitimate data in our databases could match reserved keywords?

The author of that article, Christopher Null, has an amusing workaround:

My usual trick is to simply add a period to my name: “Null.” This not only gets around many “null” error blocks, it also adds a sense of finality to my birthright.

In my goal to build systems that are easier to use, that should adapt to the consumer, this is clearly problematic. We shouldn’t be forcing Christopher to change his own name to fit our database model.

How would I go about dealing with this problem? Clearly when exporting data to CSV or generating extracts for other other third-party use, this problem has to be solved.

Delimiters are Good

The easiest way I can think of is for data types to be delimited correctly. When exporting to CSV format, we would ensure that all strings be surrounded by quotation marks. This way, Christopher Null’s record would show up as follows (assuming last name, first name ordering):

"Null","Christopher"

Then the parsing process on the reporting side would import that into whichever system without stumbling over the value.

Another issue raised by Christopher in his Wired article is that his email address is rejected outright by Bank of America because it contains the word null before the @ sign.

First and Last Name columns are bad

I’m going to call myself out on the above example and say that assuming names can fit neatly into two columns is bad. Firstly, Cher and Madonna would complain.

Secondly, Vice Admiral Horatio Nelson, 1st Viscount Nelson, 1st Duke of Bronté, would run out of space before his name could be entered correctly.

This doesn’t even begin to address names that are outside of our mainly Western perspective. In South Africa, I saw names that few developers in North America would consider, making use of all manner of punctuation (including spaces).

My recommendation here is to have a 250-character NVARCHAR column called FullName and leave it at that. If you really want your mail merge software to send “personalised” emails or letters, add an additional PreferredName column, and make sure it’s properly delimited when exporting.

Christopher Null wouldn’t have to bat an eyelid here. It would solve for his problem.

(While I’m on this matter, don’t ask for Gender or Sex. You don’t need it. If you do, you’ll know how to ask properly.)

Email Validation is Stupid

My third rule for ensuring that Nulls can exist in this world with Wests and Obawomiyes is to stop validating email addresses!

Firstly, most developers use a regular expression to do it, which is wrong and terrible and slow, and (clearly) imperfect.

This is a particular bugbear of mine, since I have one character before the @ in my personal email address. Many systems fail when trying to validate it, including WestJet. My own workaround is to create a westjet@ alias so that I can receive email from WestJet.

The best way to validate an email address is to send an email to that address with a link for someone to click on. If they don’t click on the link to close the loop, the email address isn’t valid, and your well-designed system won’t allow them to continue to the next step unless that link is clicked.

Summary

Three simple recommendations could help the Nulls find visibility again and ensure data inserted into a system is not modified unnecessarily.

For more reading on this and similar problems, I recommend visiting Bobby Tables.

If you have any experience in this area, please feel free to contact me on Twitter, at @bornsql .

A dalliance with distance

During a recent engagement, I was tasked with making a query faster. Since I signed an NDA (Non-Disclosure Agreement), I can’t go into detail, but it made extensive use of the STDistance() function of the GEOGRAPHY data type.

The query is being called many times a second, and owing to the way the application was designed, is effectively running in a cursor. For each row in a table that contains GPS coordinates, it has to calculate how far the principle GPS coordinates are from that row’s latitude and longitude.

As you can imagine, even with only a few hundred rows, this does not scale well. The database had hundreds of thousands of rows that needed to be calculated each time the query ran because these values are constantly changing.

The data was then being ordered on the STDistance result so that the nearest neighbour showed up first.

After some investigation, I discovered that the spatial index on the coordinates column was not set up correctly. However, we were unable to schedule a maintenance window to rebuild that index. Therefore, I was tasked to make the query faster without modifying any indexes.

To begin with, removing the ORDER BY clause made the query run much more quickly, even though it was running in a loop. After removing the looping structure and refactoring the query to be set based, it ran even more quickly. No surprises there, but the result set was now much larger.

However, it was still causing blocking on the table where coordinates were being stored, and the returning data was unsorted.

Enter the Great Circle Rule. In principle, it states that any distance on the surface of a sphere can be calculated using a simple mathematical formula (known as the Haversine Formula). If there’s one thing I know about computers, it’s that they perform mathematical formulas really well.

The haversine formula is not simple to express in a T-SQL query. Instead, we can use the spherical law of cosines, which is not as accurate (there is a maximum drift of around 2 metres, or a 0.3% margin of error).

(cos c = cos a cos b + sin a sin b cos C)

The advantage of this cosine law is that it’s one line of code, which means it can be used as a drop-in replacement for STDistance.

The customer agreed that for the purpose of this query, 0.3% was well within an acceptable error range. I helped the lead developer rewrite the query to replace STDistance value with the cosine formula, which increased performance by a factor of 1000 and did not need to use the spatial index at all.

In purely T-SQL terms, this looks as follows (the [Latitude] and [Longitude] are columns in the table I’m querying):

-- Latitude of source
DECLARE @Latitude FLOAT(53) = 34.09833
-- Longitude of source
DECLARE @Longitude FLOAT(53) = -118.32583
-- Diameter of the earth, in miles
DECLARE @Diameter FLOAT(53) = 3959

SELECT [ColPK], [Latitude], [Longitude],
ACOS(COS(RADIANS(90 - @Latitude)) * COS(RADIANS(90 - [Latitude]))
+ SIN(RADIANS(90 - @Latitude)) * SIN(RADIANS(90 - [Latitude]))
* COS(RADIANS(@Longitude - [Longitude]))) * @Diameter AS [Distance]
FROM [InterestingTable]

(Source: http://gis.stackexchange.com/a/40929)

Now of course this can be ordered and sorted as required. Because the distance is expressed in miles, we can filter the results to only display values that fall within a specific radius from the source, making the query run even faster.

If you have any neat geography data things to share, find me on Twitter at @bornsql .

Data Efficiency in SQL Server: DECIMAL

This is part two of a short series of posts about how I assign efficient data types when designing a new table or database.

Use less space with BIGINT

Last week, I spoke about a more efficient DATETIME than DATETIME. This week I have a more efficient DECIMAL than DECIMAL.

If you’re planning to store a long number that will never have a fraction, and you were considering using DECIMAL, I’m going to ask you to think about BIGINT first.

Granted, this is a very specific use-case, and requires some forethought, but that’s our job: to think ahead.

Here’s the scenario where I opted in for a BIGINT over a DECIMAL: IMEI numbers.

According to Wikipedia, the International Mobile Station Equipment Identity (that long number on your mobile device that identifies it uniquely—well, mostly) is usually from 14 to 16 digits in length, and may be as short as 8 digits in length.

A DECIMAL data definition requires two values: precision and scale.

Let’s say you want to store the value 123,456.789. The number of digits in total (both to the left and right of the decimal point) is nine. Thus, the precision is 9.

The scale is everything to the right of the decimal point, which in this case is 3.

In other words, we would define the this value as DECIMAL(9,3).

A precision of 1–9 digits only requires 5 bytes of storage, which makes this far more efficient than a BIGINT, which requires the full 8 bytes of storage, plus it cannot store the fraction.

However, as the precision increases, the storage requirements increase. A precision of 10–19 digits requires 9 bytes (a 45% jump in storage required), and this is the sweet spot where a BIGINT is an effective alternative.

If you do not need to store a fraction, as in the case of an IMEI number, consider BIGINT. 8 bytes will safely store a value up to 19 digits in length (for a maximum value of 2⁶³-1).

So if you’re storing IMEI numbers for a country’s worth of mobile devices, say 250 million rows, one byte will save you 238MB in memory, queries, storage, indexing, backups, and so on.

I think that’s worth thinking about.

Hit me up on Twitter, at @bornsql, to tell me how you’ve made efficient data design choices.

Data Efficiency in SQL Server: DATETIME

This is the first in a short series of posts about how I assign efficient data types when designing a new table or database.

Use less space with DATETIME2

We all know that the DATETIME column uses 8 bytes of space to store the date and time, to an accuracy of a paltry 3 milliseconds.

This used to cause no end of drama in the days before SQL Server 2008 because milliseconds were rounded to end in 0, 3 or 7. When trying to calculate the last possible time in a day, you had to work with some crazy values like 2016-03-15 23:59:59.997.

Fortunately, SQL Server 2008 introduced DATETIME2, which put paid to that drama. We now have a precision of up to 7 places after the decimal, and it still only uses a maximum of 8 bytes!

For example, we can now store a value of 2015-03-15 23:59:59.9999999 which is mind-bogglingly close to midnight, and not worry about insane values ending in a 7.

Do you really need that precision, though? How about a way to actually use less storage and store more accurate data?

If 0–2 places of precision (after the decimal) requires only 6 bytes, 3–4 places requires 7 bytes, and 5–7 places requires the full 8 bytes, we can save ourselves a whole byte and not lose any precision by using DATETIME2(3) as a drop-in replacement for DATETIME.

We get precision down to the millisecond (even more accurate than before) and shave off one byte for every row in that column. In a table with ten million rows, that’s 9.5MB of space we no longer have to store, or query, or index, or back up.

What data efficiency method have you used in production? Let me know on Twitter at @bornsql.