Blog

How to write a SELECT query

My First SELECT Statement

Microsoft SQL Server makes it really easy for us to query tables. In SQL Server Management Studio (SSMS) for instance, we can right-click on any table we have access to and select the top 1000 rows from that table.

Don’t do this

Please don’t query tables this way in a production environment. It’s a bad way to do it, and you should feel bad.

Writing a SELECT statement should be done manually, the way Ada Lovelace and Grace Hopper intended.

Let’s assume we want to get a list of the stores in our database that we created in the First Look at Normalization post.

The table had the following columns: StoreID, StoreCode, StoreName, Address, ManagerName, and ManagerEmail.

To get a list of all rows and all columns in a table, we write the following statement:

SELECT
[StoreID],
[StoreCode],
[StoreName],
[Address],
[ManagerName],
[ManagerEmail]
FROM
[dbo].[Stores];

Remember from previous posts that the square brackets are a matter of style, and we could just as easily exclude them.

I leave them in because humans are terrible at naming things, and the square brackets make the code less likely to fail.

You’ll notice that there is a semi-colon at the end of the statement. We could have placed the entire statement on one line, like so:

SELECT [StoreID], [StoreCode], [StoreName], [Address], [ManagerName], [ManagerEmail] FROM [dbo].[Stores];

This is more difficult to read. SQL Server doesn’t really care about white space, including carriage returns, so feel free to format your code nicely so that it’s easy for you to read.

If you’re typing this in Management Studio, you can now press the Execute button in the menu, or the F5 key on the keyboard, to run the command.

Tip: in Management Studio, we can select just the text (T-SQL code) we want to run, using the mouse, and then press F5. This will guarantee that no other part of the script will run as well. It’s a useful way to run portions of code in a longer script.

Once we run the SELECT statement, we see a result set.

Click to enlarge

Congratulations! We have asked SQL Server for data, and it has responded with the data we asked for.

Next time, we will be adding data to a table using the INSERT command. Stay tuned.

Look me up on Twitter if you want to ask any questions about this series, on @bornsql.

Querying a Database

When we want to retrieve information from a database, we query the structure with language appropriate to the database.

Remember right at the start of this series we saw that a database could be a phone book or a recipe book. So how do we find the phone number of Randolph West? By looking up the surnames starting with W, and going through all the Wests in alphabetical order until we get to that entry.

The same goes for finding the recipe for lemon meringue in a recipe book. Start at the index at the back, look through the desserts, and then search for meringue.

In a RDBMS (relational database management system), the language for querying data is called Structured Query Language, or SQL. We can pronounce it like “sequel”, or sound out each letter.

SQL Server is commonly pronounced “Sequel Server”. MySQL is pronounced “My-S-Q-L”, sounding out the letters (some people pronounce it “My-Sequel”). It all depends on who’s saying it. PostgreSQL is just pronounced “Postgres”, because seriously.

These, along with Oracle, are the major players in the RDBMS industry.

ANSI SQL

When it comes to putting information into a database server, and getting information out, we can write queries that look very similar across platforms. This is thanks to a standards body called ANSI (American National Standards Institute), which (with proposals from each vendor) has suggested ANSI SQL syntax that all vendors should use.

For the sake of this series, we will assume that the vendors do follow the standard exactly, but in reality it isn’t that simple.

Putting data in: INSERT

Adding new data to a database is performed using an INSERT operation.

Changing data: UPDATE

Modifying existing data is done with an UPDATE operation.

Getting rid of data: DELETE

Removing rows from a table is performed with a DELETE operation.

Getting data out: SELECT

The vast majority of operations in a database has to do with data retrieval. To get data out, we SELECT it.

CRUD

In technical circles, between software developers and database developers, we might refer to these four operations using the mnemonic CRUD, which stands for Create, Read, Update, Delete.

When referring to a specific database or application, it could mean that the database is just being used as a data store (a virtual box of random stuff) and may not have proper relationships between the tables, nor be normalized.

There’s nothing necessarily evil about denormalized data, because the application code may handle that intelligence. Be wary though. Using an RDBMS to store non-relational data might cause headaches.

Next time we will write our first SELECT statement to query a database. Stay tuned!

Look me up on Twitter if you want to ask any questions about this series, on @bornsql.

SQL Server Management Studio v17.0

Version numbers are confusing. SQL Server Management Studio (SSMS), the client user interface by which most DBAs access SQL Server, was decoupled from the server product for SQL Server 2016.

For the last 18 months or so, we have been receiving semi-regular updates to SSMS (which we can download from Microsoft, for free), which is amazing and awesome.

SQL Server 2017 was recently announced, as I mentioned last week, and the internal server version is going to be some variation of 14.0. The compatibility mode is 140. This follows logically from previous versions.

Management Studio, however, has taken a different tack with version numbers. The latest version, which was released at the end of April, is 17.0.

So if you’re confused, you’re not alone.

To make up for some of that confusion, here’s a neat trick in SSMS 17. Let’s say I want to present a session at a SQLSaturday. In the past, I had to modify the fonts and sizes for myriad settings in Management Studio. Paul Randal has a seminal post about this very thing.

With version 17, we can enable a new feature called Presenter Mode, which automatically sets the fonts and sizes of the SSMS interface to improve visibility when sharing your screen in a conference call or using a projector.

In the Quick Launch textbox on the top right of the screen (press Ctrl+Q), type the word PresentOn.

Our standard SSMS interface changes from this:

to this:

To switch it back to the default view, it’s not PresentOff as we might expect. Instead, we must type RestoreDefaultFonts in the Quick Launch textbox.

Note: the results grid will not take on the new settings until SSMS is restarted. Keep this in mind when switching Presenter Mode on and off.

If you have any more SQL Server Management Studio 17 tips to share, find me on Twitter at @bornsql.

SQL Server 2017 Announced

By now you will have heard that the next version of SQL Server has been announced. There’s no release date yet, but Brent Ozar seems to think it’ll be before June.

There are many new features, but the biggest deal is that SQL Server 2017 runs on both Windows Server and Linux. Yup, SQL Server 2017 is certified to run on Windows Server, Red Hat Enterprise Linux, Ubuntu, and SuSE. (You can even run it on macOS in a Docker container.)

There are some big improvements to the Query Optimizer as well, starting with Adaptive Query Processing. It’s like a smart go-faster button, making incremental improvements to your queries the more often they run.

Despite being in pre-release mode, Microsoft states that 2017 is production-ready, which means that it should work as expected.

Watch the announcement, and download the latest technical preview to try it for yourself.

Something something Production something something you idiot

You’re reading this series of posts because you want to learn about databases and how to use them.

What you should not be doing is learning about databases and how to use them, while working inside a production environment.

Also called “prod”, because we’re lazy, this server (or servers) is not meant for testing things out. We should always make sure that when practising new bits of code, we do it in a development environment (“dev”). At least if we make a mistake (and mistakes happen a lot in this field), it won’t cause the main payroll system to crash.

The best way to set up a development server is to create a virtual machine where you can install anything you like, and if something goes wrong, you can delete and rebuild the virtual machine without causing monetary and/or data loss.

I’m not kidding. Go set up a virtual machine. If you don’t know how, then ask me and I’ll explain it (there’s a future blog post for setting up a virtual machine).

Get off production.

Upcoming SQLSaturdays in Alberta, Canada

Taking a short break from the Database Fundamentals series of the last few weeks, I’d like to mention some upcoming PASS community events in the province of Alberta.

I will be presenting at SQLSaturday #594 in Edmonton on 22 April 2017 (this coming Saturday). My topic is Migrating to Azure SQL Database: Tips, Tricks and Lessons Learned.

Next weekend, I will be hosting SQLSaturday #607 in Calgary on 29 April 2017. This is the first ever SQLSaturday in the city of Calgary, and we even have a special message from our celebrity mayor, Naheed Nenshi.

If you live in or around these two cities, please come and say hi. You can also reach out to me on Twitter at @bornsql or @sqlsatcalgary.

The Database Fundamentals series will continue next week.

So, like, what is a byte?

A friend of mine in the filmmaking business, who is exceedingly bright but has never worked with SQL Server before, was reading through the first five posts of this Database Fundamentals series, and asked a great question:

“I guess I’m not understanding what a byte is. I think I’m circling the drain in understanding it, but not floating down.”

She has a way with words.

I answered her immediately, but it reminded me that I did get a little carried away with data types, assuming that everyone reading that post would understand what a byte is.

In the innards of the computer is the CPU, or Central Processing Unit (there might be more than one in a server). The CPU is best described as a hot mess of on-off switches. Just as it is in your house, a switch only has two states.

David Hasselhoff, SQL Server DBA

This is what “binary” means. When the CPU clock ticks over, billions of times per second, if a switch is closed, it’s a 1 (electricity can flow to complete the circuit). If the switch is open, it’s a 0 (electricity cannot pass through it).

(Source: https://diytechpro.com/electric-circuit-simple-concept/)

The CPU (and memory, and storage system, and network) understand binary, and the software that sits on top of it uses binary as well.

We end up with a series of 1s and 0s that, when arranged in different combinations, represent information in some form or another. Each of these is a binary digit, or bit.

Through a series of decisions in the old days of computing, when we stick eight of these bits of data together, they form a byte.

Now comes the mathematical part of today’s post.

If we have 8 digits that can store two values each, we get a total of 2 x 2 x 2 x 2 x 2 x 2 x 2 x 2 combinations. This is more easily typed as 2^8, or 256. In other words, a byte can store a maximum of 256 values.

Here’s a short list of bytes to give you an example (I have not listed every one of the 256 possibilities). We write the bits in groups of four to make them easier to read.

Binary Decimal ASCII Binary Decimal ASCII
0010 0000 32 <space> 1000 0001 129 Å
0010 0001 33 ! 1000 0010 130 Ç
0010 0010 34 1000 0011 131 É
0010 0011 35 # 1000 0100 132 Ñ
0010 0100 36 $ 1000 0101 133 Ö
0010 0101 37 % 1000 0110 134 Ü
0010 0110 38 & 1000 0111 135 á
0010 0111 39 1000 1000 136 à
0010 1000 40 ( 1000 1001 137 â
0010 1001 41 ) 1000 1010 138 ä
0010 1010 42 * 1000 1011 139 ã
0010 1011 43 + 1000 1100 140 å
0010 1100 44 , 1000 1101 141 ç
0010 1101 45 1000 1110 142 é
0010 1110 46 . 1000 1111 143 è
0010 1111 47 / 1001 0000 144 ê
0011 0000 48 0 1001 0001 145 ë
0011 0001 49 1 1001 0010 146 í
0011 0010 50 2 1001 0011 147 ì
0011 0011 51 3 1001 0100 148 î
0011 0100 52 4 1001 0101 149 ï
0011 0101 53 5 1001 0110 150 ñ
0011 0110 54 6 1001 0111 151 ó
0011 0111 55 7 1001 1000 152 ò
0011 1000 56 8 1001 1001 153 ô
0011 1001 57 9 1001 1010 154 ö
0011 1010 58 : 1001 1011 155 õ
0011 1011 59 ; 1001 1100 156 ú
0011 1100 60 < 1001 1101 157 ù
0011 1101 61 = 1001 1110 158 û
0011 1110 62 > 1001 1111 159 ü
0011 1111 63 ? 1010 0000 160
0100 0000 64 @ 1010 0001 161 °
0100 0001 65 A 1010 0010 162 ¢
0100 0010 66 B 1010 0011 163 £
0100 0011 67 C 1010 0100 164 §
0100 0100 68 D 1010 0101 165
0100 0101 69 E 1010 0110 166
0100 0110 70 F 1010 0111 167 ß
0100 0111 71 G 1010 1000 168 ®
0100 1000 72 H 1010 1001 169 ©
0100 1001 73 I 1010 1010 170
0100 1010 74 J 1010 1011 171 ´
0100 1011 75 K 1010 1100 172 ¨
0100 1100 76 L 1010 1101 173
0100 1101 77 M 1010 1110 174 Æ
0100 1110 78 N 1010 1111 175 Ø
0100 1111 79 O 1011 0000 176
0101 0000 80 P 1011 0001 177 ±
0101 0001 81 Q 1011 0010 178
0101 0010 82 R 1011 0011 179
0101 0011 83 S 1011 0100 180 ¥
0101 0100 84 T 1011 0101 181 µ
0101 0101 85 U 1011 0110 182
0101 0110 86 V 1011 0111 183
0101 0111 87 W 1011 1000 184
0101 1000 88 X 1011 1001 185 π
0101 1001 89 Y 1011 1010 186
0101 1010 90 Z 1011 1011 187 ª
0101 1011 91 [ 1011 1100 188 º
0101 1100 92 \ 1011 1101 189 Ω
0101 1101 93 ] 1011 1110 190 æ
0101 1110 94 ^ 1011 1111 191 ø
0101 1111 95 _ 1100 0000 192 ¿
0110 0000 96 ` 1100 0001 193 ¡
0110 0001 97 a 1100 0010 194 ¬
0110 0010 98 b 1100 0011 195
0110 0011 99 c 1100 0100 196 ƒ
0110 0100 100 d 1100 0101 197
0110 0101 101 e 1100 0110 198
0110 0110 102 f 1100 0111 199 «
0110 0111 103 g 1100 1000 200 »
0110 1000 104 h 1100 1001 201
0110 1001 105 i 1100 1010 202
0110 1010 106 j 1100 1011 203 À
0110 1011 107 k 1100 1100 204 Ã
0110 1100 108 l 1100 1101 205 Õ
0110 1101 109 m 1100 1110 206 Œ
0110 1110 110 n 1100 1111 207 œ
0110 1111 111 o 1101 0000 208
0111 0000 112 p 1101 0001 209
0111 0001 113 q 1101 0010 210
0111 0010 114 r 1101 0011 211
0111 0011 115 s 1101 0100 212
0111 0100 116 t 1101 0101 213
0111 0101 117 u 1101 0110 214 ÷
0111 0110 118 v 1101 0111 215
0111 0111 119 w 1101 1000 216 ÿ
0111 1000 120 x 1101 1001 217 Ÿ
0111 1001 121 y 1101 1010 218
0111 1010 122 z 1101 1011 219
0111 1011 123 { 1101 1100 220
0111 1100 124 | 1101 1101 221
0111 1101 125 } 1101 1110 222
0111 1110 126 ~ 1101 1111 223
0111 1111 127 1110 0000 224
1000 0000 128 Ä 1110 0001 225 ·

There are values missing from the above table, for characters that cannot be displayed correctly in a web browser. For a complete table showing all 256 characters, visit PC Guide.com.

How does this affect Unicode values? If you remember in our post about CHAR, NCHAR, VARCHAR and NVARCHAR data types, we discovered that the Unicode versions (those types starting with N) will use two bytes in memory and on disk to store a single character, compared to the non-Unicode (sometimes called ASCII or plain text) data types, which use only one byte per character.

The high-level reason for this is that some alphabets have more than 256 characters, so the code page (the full set of characters in upper- and lower-case where applicable, plus all the numbers, punctuation marks, and so forth) won’t fit in the 256 possibilities available in a single byte.

When we stick two bytes together however, we suddenly have as many as 2^16 values that we can store, for a total of 65,536 possibilities. This is mostly good enough if you’re not storing Japanese in SQL Server.

There are exceptions to this, where some kanji takes up four bytes per character. This is known as UTF-32 (Unicode Transformation Format, 32 bits per character). The good news is, SQL Server does support multi-byte characters wider than standard (UTF-16) Unicode, as long as we pick the correct collation.

I hope this answers any burning questions you may have had about bits and bytes.

Feel free to reach out to me on Twitter at @bornsql.

Normalization, The Sequel

If there’s one thing that SQL Server is really good at, it’s relationships. After all, a relational database management system without the relationships is nothing more than a place to store your stuff.

Last week we briefly looked at a denormalized table, and then I suggested that breaking it up into five separate tables would be a good idea. So good, in fact, that it took me more than 2,000 words to explain just the first table in our highly contrived example.

Assuming you have read through all those words, let’s attempt a much more condensed look at the other four tables. If you recall, we had:

  • Transactions
  • Products
  • Customers
  • Salespersons
  • Stores

We tackled the Stores table first because everything is backwards when we design databases.

For the next three tables, I’m going to just show you how I would design them, without going into detail. Take a close look at Salespersons, though (which we’ll look at first) because it will give you a clue about how we link all the tables together finally in the Transaction table.

Then take a look at … PaymentTypes? ProductTypes? Colours? Categories? Sizes? Uh … What’s going on here? Where did all those tables come from? Luckily, T-SQL allows comments, which you’ll see below.

CREATE TABLE [Salespersons] (
[SalespersonID] SMALLINT NOT NULL IDENTITY(1,1),
[StoreID] SMALLINT NOT NULL,
[FirstName] NVARCHAR(255) NOT NULL,
[LastName] NVARCHAR(255) NOT NULL,
[EmailAddress] NVARCHAR(512) NOT NULL
CONSTRAINT [PK_Salespersons] PRIMARY KEY CLUSTERED ([SalespersonID] ASC)
);

-- List of possible payment types, (e.g. credit card, cheque)
CREATE TABLE [PaymentTypes] (
[PaymentTypeID] TINYINT NOT NULL IDENTITY(1,1),
[Description] VARCHAR(255) NOT NULL,
CONSTRAINT [PK_PaymentTypes] PRIMARY KEY CLUSTERED ([PaymentTypeID] ASC)
);

CREATE TABLE [Customers] (
[CustomerID] BIGINT NOT NULL IDENTITY(1,1),
[FirstName] NVARCHAR(255) NOT NULL,
[LastName] NVARCHAR(255) NOT NULL,
[EmailAddress] NVARCHAR(512) NOT NULL,
[Telephone] VARCHAR(25) NOT NULL,
[PaymentTypeID] TINYINT NOT NULL,
CONSTRAINT [PK_Customers] PRIMARY KEY CLUSTERED ([CustomerID] ASC)
);

-- List of possible product types (e.g. iPhone, iPhone cover, iPod)
CREATE TABLE [ProductTypes] (
[ProductTypeID] TINYINT NOT NULL IDENTITY(1,1),
[Description] VARCHAR(255) NOT NULL,
CONSTRAINT [PK_ProductTypes] PRIMARY KEY CLUSTERED ([ProductTypeID] ASC)
);

-- List of possible colours
CREATE TABLE [Colours] (
[ColourID] TINYINT NOT NULL IDENTITY(1,1),
[Description] VARCHAR(255) NOT NULL,
CONSTRAINT [PK_Colours] PRIMARY KEY CLUSTERED ([ColourID] ASC)
);

-- List of possible categories (5.5", 4.7", 4", 3.5")
-- This replaces "size", since we might use Size to denote storage
CREATE TABLE [Categories] (
[CategoryID] TINYINT NOT NULL IDENTITY(1,1),
[Description] VARCHAR(255) NOT NULL,
CONSTRAINT [PK_Categories] PRIMARY KEY CLUSTERED ([CategoryID] ASC)
);

-- List of possible sizes ("8GB", "16GB", "32GB", etc.)
-- Can also be used for other product types like laptops
CREATE TABLE [Sizes] (
[SizeID] TINYINT NOT NULL IDENTITY(1,1),
[Description] VARCHAR(255) NOT NULL,
CONSTRAINT [PK_Sizes] PRIMARY KEY CLUSTERED ([SizeID] ASC)
);

CREATE TABLE [Products] (
[ProductID] TINYINT NOT NULL IDENTITY(1,1),
[ProductTypeID] TINYINT NOT NULL,
[ColourID] TINYINT NOT NULL,
[CategoryID] TINYINT NOT NULL,
[SizeID] TINYINT NOT NULL,
[SellingPrice] SMALLMONEY NOT NULL,
CONSTRAINT [PK_Products] PRIMARY KEY CLUSTERED ([ProductID] ASC)
);

Those tables popped out of nowhere, didn’t they? Welcome to the world of normalization. To design a database properly, we come to the realisation that we can simplify the data input even more, reducing repeated values, and finding the most unique way of representing data. Products comprise product types. A single list of colours can be reused in various places. Payment types can be used for all sorts of transactional data.

We end up with a lot of tables when we normalize a database, and this is perfectly normal. When we want to read information out of the system the way senior management wants, we must join all these tables together.

The only way to join tables together in a safe and meaningful way is with foreign key relationships, where one table’s primary key is referenced in another table, with a matching data type.

The SalesPersons table has a StoreID column. As it stands, there’s no relationship between Stores and SalesPersons until we create the relationship using T-SQL.

ALTER TABLE [SalesPersons]
ADD CONSTRAINT FK_SalesPersons_Stores FOREIGN KEY (StoreID)
REFERENCES [Stores] (StoreID);

  • Line 1: Inform SQL Server that we are altering an existing table
  • Line 2: By adding a foreign key constraint (i.e. limiting what can go into the StoreID column)
  • Line 3: By forcing it to use the values from the Stores table’s StoreID column (i.e. the primary key).

In a relationship diagram, it looks like this (in SQL Server Management Studio’s Database Diagram tool):

The yellow key in each table is the Primary Key (StoreID and SalespersonID respectively). There is a StoreID column in both tables with the same data type (SMALLINT). The foreign key (FK) does not have to match the name of the primary key (PK), but it makes things a lot easier to have the same name for both sides of a relationship in large databases, so it’s a good habit.

Notice the direction of the relationship (FK_Salespersons_Stores) in the picture, with the yellow key on the table with the Primary Key. The name of the relationship is also sensible. To the casual eye, this says that there’s a Foreign Key constraint in the Salespersons table that points to the Primary Key in the Stores table.

Now we see why data types are so important with relational data. A relationship is not even possible between two tables if the data type is not the same in both key columns.

With this constraint enabled, whenever we insert data into the Salespersons table, we have to make sure that whatever we put into the StoreID column must already exist in the Stores table.

Let’s do the rest of the relationships so far, and then we’ll look at the Transactions table.

ALTER TABLE [Customers]
ADD CONSTRAINT FK_Customers_PaymentTypes FOREIGN KEY (PaymentTypeID)
REFERENCES [PaymentTypes] (PaymentTypeID);

ALTER TABLE [Products]
ADD CONSTRAINT FK_Products_ProductTypes FOREIGN KEY (ProductTypeID)
REFERENCES [ProductTypes] (ProductTypeID);

ALTER TABLE [Products]
ADD CONSTRAINT FK_Products_Colours FOREIGN KEY (ColourID)
REFERENCES [Colours] (ColourID);

ALTER TABLE [Products]
ADD CONSTRAINT FK_Products_Categories FOREIGN KEY (CategoryID)
REFERENCES [Categories] (CategoryID);

ALTER TABLE [Products]
ADD CONSTRAINT FK_Products_Sizes FOREIGN KEY (SizeID)
REFERENCES [Sizes] (SizeID);

As we can see now, we have more than one foreign key relationship in the Products table, to ProductTypes, Colours, Categories, and Sizes, which is a clue to how the Transactions table will look.

CREATE TABLE [Transactions] (
[TransactionID] BIGINT NOT NULL IDENTITY(1,1),
[TransactionDate] DATETIME2(3) NOT NULL,
[ProductID] TINYINT NOT NULL,
[DiscountPercent] DECIMAL(4,2) NOT NULL DEFAULT(0),
[SalesPersonID] SMALLINT NOT NULL,
[CustomerID] BIGINT NOT NULL,
[HasAppleCare] BIT NOT NULL DEFAULT(0),
CONSTRAINT [PK_Transactions] PRIMARY KEY CLUSTERED ([TransactionID] ASC)
);

Let’s assume we’ve created all the relationships as well, so that we’re left with the following relationships (click to enlarge):

Ten tables, compared to our original one denormalized table, is a significant increase in number of tables. However, let’s compare the data usage for adding transactions.

If we were to populate each row of each table with enough data to provide the same number of transactions as last week’s two purchases, it would look like this.

Stores table: 172 bytes

  • StoreID: 2 bytes (IDENTITY value as SMALLINT)
  • StoreCode: 0 bytes (NULL)
  • StoreName: 30 bytes (“Chinook Centre” is 14 characters long including the space, so with Unicode that becomes 28 bytes, plus 2 for the overhead of using NVARCHAR)
  • Address: 86 bytes (“6455 Macleod Trail SW, Calgary, AB T2H 0K8” is 42 characters, so 84 with Unicode, plus 2 bytes for NVARCHAR overhead)
  • ManagerName: 22 bytes (“Bob Bobson” in Unicode, plus 2 bytes NVARCHAR overhead)
  • ManagerEmail: 32 bytes (“[email protected]” in Unicode, plus 2 bytes NVARCHAR overhead)

Salespersons table: 70 bytes

  • SalespersonID: 2 bytes
  • StoreID: 2 bytes
  • FirstName: 14 bytes (“Thandi” in Unicode + 2 bytes)
  • LastName: 14 bytes (“Funaki” in Unicode + 2 bytes)
  • EmailAddress: 38 bytes (“[email protected]” in Unicode + 2 bytes)

PaymentTypes table: 14 bytes

  • PaymentTypeID: 1 byte
  • Description: 13 bytes (“Credit Card” + 2 bytes)

ProductTypes table: 23 bytes

  • ProductTypeID: 1 byte
  • Description: 8 bytes (“iPhone” + 2 bytes)
  • ProductTypeID: 1 byte
  • Description: 13 bytes (“iPhone Case” + 2 bytes)

Colours table: 21 bytes

  • ColourID: 1 byte
  • Description: 7 bytes (“Black” + 2 bytes)
  • ColourID: 1 byte
  • Description: 6 bytes (“Blue” + 2 bytes)
  • ColourID: 1 byte
  • Description: 5 bytes (“Red” + 2 bytes)

Categories table: 7 bytes

  • CategoryID: 1 byte
  • Description: 6 bytes (“5.5″” + 2 bytes)

Sizes table: 16 bytes

  • SizeID: 1 byte
  • Description: 7 bytes (“128GB” + 2 bytes)
  • SizeID: 1 byte
  • Description: 7 bytes (“256GB” + 2 bytes)

Products table: 27 bytes (3 products at 9 bytes each)

  • ProductID: 1 byte
  • ProductTypeID: 1 byte
  • ColourID: 1 byte
  • CategoryID: 1 byte
  • SizeID: 1 byte
  • SellingPrice: 4 bytes

Customers table: 192 bytes

  • CustomerID: 8 bytes
  • FirstName: 10 bytes (“I.M.” in Unicode + 2 bytes)
  • LastName: 18 bytes (“Customer” in Unicode + 2 bytes)
  • EmailAddress: 42 bytes (“[email protected]” in Unicode + 2 bytes)
  • Telephone: 16 bytes (“(403) 555-1212” + 2 bytes)
  • PaymentTypeID: 1 byte
  • CustomerID: 8 bytes
  • FirstName: 10 bytes (“U.R.” in Unicode + 2 bytes)
  • LastName: 18 bytes (“Customer” in Unicode + 2 bytes)
  • EmailAddress: 44 bytes (“[email protected]” in Unicode + 2 bytes)
  • Telephone: 16 bytes (“(403) 665-0011” + 2 bytes)
  • PaymentTypeID: 1 byte

Transactions table: 96 bytes (32 bytes per transaction)

  • TransactionID: 8 bytes
  • TransactionDate: 7 bytes
  • ProductID: 1 byte
  • DiscountPercent: 5 bytes
  • SalesPersonID: 2 bytes
  • CustomerID: 8 bytes
  • HasAppleCare: 1 bit (expands to 1 byte)

GRAND TOTAL: 638 bytes to represent all three transactions

The denormalized version, for which we can see the original example below, works out as follows. Recall we said last week that each column was NVARCHAR(4000), or possibly even NVARCHAR(MAX).

A Wide Table
A Wide Table Appears – click to enlarge

At our most generous, we would need 1,166 bytes to record these three transactions. That’s almost double the data required, just for these three. Plus, the data has no foreign key relationships, so we cannot be sure that whatever is being added to the denormalized table is valid.

As time goes on, the normalized tables will grow at a much lower rate proportionally. Consider what a denormalized Transactions table would look like with an average row size of 388 bytes, for ten million rows (3.6GB).

Compare that to a normalized database, with ten million transactions for 8 million customers. Even assuming we have a hundred products, with twenty colours, and 30 product types, we would see only around 1GB of space required to store the same data.

We know Apple as being one of the most successful technology companies in terms of sales, so extrapolating to 1 billion transactions, we’d be comparing 361GB (for the denormalized table) with less than half that (178GB) if every single customer was unique and only ever bought one item.

Aside from the staggering savings in storage requirements, normalization gives us sanity checks with data validation by using foreign key constraints. Along with proper data type choices, we have an easy way to design a database properly from the start.

Sure, it takes longer to start, but the benefits outweigh the costs almost immediately. Less storage, less memory to read the data, less space for backups, less time to run maintenance tasks, and so on.

Normalization matters.

Next week, we talk briefly about bits and bytes, and then we will start writing queries. Stay tuned.

Find me on Twitter to discuss your favourite normalization story at @bornsql.

A First Look At Normalization

Phew! There’s a lot to take in with data types, collation, precision, scale, length, and Unicode, and we’re just getting warmed up. This week’s post is over 2,000 words long!

Over the last three weeks, we’ve gone fairly deep into data types, and now we are going to see how they come into play with normalization.

If we go back to the first post in this series, I mentioned normalization, and then apparently I forgot about it in the next two posts. What you didn’t see is that I was talking about it all along.

Continue reading “A First Look At Normalization”

Fundamentals of Data Types

Last week, we discussed storing text in a database. This week we will dive deeper into data types.

When storing data in our database, we want to make sure that it’s stored accurately and that we only use the required amount of space.

This is because when we access the data later, we want to make sure any calculations are accurate; plus reading the data takes up memory, and we want to be as efficient as we can with memory usage.

There are seven data type categories in SQL Server:

  • exact numerics
  • approximate numerics
  • date and time
  • character strings
  • Unicode character strings
  • binary strings
  • other

When we want to use these data types for our columns, we need to declare them. Some require a length, some require a precision and scale, and some can be declared without a length at all. For example:

No Length (implied in data type):
DECLARE @age AS TINYINT;

Explicit Length (length is supplied):
DECLARE @firstName AS VARCHAR(255);

Precision and Scale:
DECLARE @interestRate AS DECIMAL(9,3);

Let’s talk a bit about precision and scale, because those values between the brackets may not work the way we think they do.

Precision and Scale

Data types with decimal places are defined by what we call fixed precision and scale. Let’s look at an example:

123,456.789

In the above number, we see a six-digit number (ignoring the thousand separator) followed by a decimal point, and then a fraction represented by three decimal places. This number has a scale of 3 (the digits after the decimal point) and a precision of 9 (the digits for the entire value, on both sides of the decimal point). We would declare this value as DECIMAL(9,3).

This is confusing at first glance, because we have to declare it “backwards”, with the precision first, and then the scale. It may be easier to think of the precision in the same way we think of a character string’s length.

Date and time data types can also have decimal places, and SQL Server supports times accurate to the nearest 100 nanoseconds. The most accurate datetime is DATETIME2(7), where 7 decimal places are reserved for the time.

Before SQL Server 2008, we used DATETIME, which is only accurate to the nearest 3 milliseconds, and uses 8 bytes. A drop-in replacement for this is DATETIME2(3), using 3 decimal places, and accurate to the nearest millisecond. It only needs 7 bytes per column.

Be mindful that, as higher precision and scale are required, a column’s storage requirement increases. Accuracy is a trade-off with disk space and memory, so we may find ourselves using floating point values everywhere.

However, in cases where accuracy is required, always stick to exact numerics. Financial calculations, for example, should always use DECIMAL and MONEY data types.

Exact Numerics

Exact Numerics are exact, because any value that is stored is the exact same value that is retrieved later. These are the most common types found in a database, and INT is the most prevalent.

Exact numerics are split up into integers (BIGINT, INT, SMALLINT, TINYINT, BIT) and decimals (NUMERIC, DECIMAL, MONEY, SMALLMONEY). Decimals have decimal places (defined by precision and scale), while integers do not.

Integers have fixed sizes (see table below), so we don’t need to specify a length when declaring this data type.

Type Bytes Range
BIGINT 8 bytes -2^63 to 2^63-1
INT 4 bytes -2^31 to 2^31-1
SMALLINT 2 bytes -2^15 to 2^15-1
TINYINT 1 byte 0 to 255
BIT 1 bit 0 to 1
  • BIT is often used for storing Boolean values, where 1 = True and 0 = False.
  • Yes, BIGINT can store numbers as large as 2 to the power of 63 minus 1. That’s 19 digits wide, with a value of 9,223,372,036,854,775,807, or 9.2 quintillion.

Decimals may vary depending on the precision and scale, so we have to specify those in the declaration.

Type Bytes Range
DECIMAL 5 to 17 bytes Depends on precision and scale.
38 digits is the longest possible precision.
NUMERIC
  • DECIMAL and NUMERIC are synonyms and can be used interchangeably. Read more about this data type, and how precision and scale affects bytes used, here.

Although the MONEY and SMALLMONEY data types do have decimal places, they don’t require the precision and scale in the declaration because these are actually synonyms for DECIMAL(19,4) and DECIMAL(10,4) respectively. Think of these data types for convenience more than anything.

Type Bytes Range
MONEY 8 bytes -922,337,203,685,477.5808 to 922,337,203,685,477.5807
SMALLMONEY 4 bytes -214,748.3648 to 214,748.3647

Approximate Numerics

Approximate Numerics mean that the value stored is only approximate. Floating point numbers would be classified as approximate numerics, and these comprise FLOAT and REAL.

Declaring a FLOAT requires a length, which represents the number of bits used to store the mantissa. REAL is a synonym of FLOAT(24).

The mantissa means the significant digits of a number in scientific notation, which is how floating point numbers are represented. The default is FLOAT(53). Generally, we stick to the defaults, and use REAL if we want to save space, forgoing some accuracy of the larger FLOAT(53).

Type Bytes Range
FLOAT 4 or 8 bytes -1.79E+308 to -2.23E-308, 0 (zero),
and 2.23E-308 to 1.79E+308
REAL 4 bytes -3.40E+38 to -1.18E-38, 0 (zero),
and 1.18E-38 to 3.40E+38

Date and Time

Date and time data types are slightly more complex. For storing dates (with no time), we use DATE. We store times (with no dates) using TIME. For storing both date and time in the same column, we can use DATETIME2, DATETIME, or SMALLDATETIME. Finally, we can even store timezone-aware values comprising a date and time and timezone offset, using DATETIMEOFFSET.

DATETIME2, TIME, and DATETIMEOFFSET take a length in their declarations, otherwise they default to 7 (accurate to the nearest 100 nanoseconds).

Character Strings

As we saw last week, characters can be fixed-length (CHAR) or variable-length (VARCHAR), and can support special Unicode character types (NCHAR and NVARCHAR respectively). Collation should also be taken into account.

Length can be 1 to 8000 for CHAR and VARCHAR, or 1 to 4000 for NCHAR and NVARCHAR. For storing values larger than that, see the Large Objects section below.

Binary Strings

Sometimes we want to store binary content in a database. This might be a JPEG image, a Word document, an SSL certificate file, or anything that could traditionally be saved on the file system. SQL Server provides the BINARY and VARBINARY data types for this (and IMAGE for backward compatibility).

Length can be 1 to 8000 for BINARY and VARBINARY. For storing values larger than that, see the Large Object section below.

Large Objects

SQL Server 2008 introduced a new MAX length for several data types, including CHAR, NCHAR, VARCHAR, NVARCHAR, BINARY and VARBINARY.

(The XML data type uses MAX under the covers as well.)

This new specification allows up to 2 GB of data to be stored in a column with that declared length. We should take care not to use 2 GB when inserting data into these columns, but it provides greater flexibility when inserting more than 8000 bytes into one of these columns.

Other Data Types

SQL Server supports other types of data, which fall outside the scope of text and numerics. These include CURSOR, TABLE, XML, UNIQUEIDENTIFIER, TIMESTAMP (not to be confused with the date and time types), HIERARCHYID, SQL_VARIANT, and Spatial Types (GEOGRAPHY and GEOMETRY).

Next week, we will see how normalization and data types work together, now that we have a good overview of the different data types in a database.

If you have any thoughts or comments, please find me on Twitter at @bornsql.