My surname is NULL

Last Wednesday on Twitter, Abayomi Obawomiye (@SQLAmerica) wrote:

I just met someone with the last name NULL today. I really want to ask if they had issues with the last name but worried might be offensive

Abayomi goes on further to say that the word “NULL” gets replaced with a space in any CSV files generated for reports, which got me thinking about this problem.

Then I realised I had already written about it. We make assumptions on a daily basis when it comes to working with data. In this case, Abayomi assumed that no one’s name could ever be “Null”.

Yes, I’ve made that assumption too.

This sparked a memory in the deep and rusty recesses of my brain. I remember reading an article about someone whose last name is Null. It took me just a few seconds to find the article on Wired Magazine’s website, and the pullout phrase is right in the headline:

Hello, I’m Mr. Null. My Name Makes Me Invisible to Computers

It turns out that Abayomi is not alone with this chance meeting. How many of us consider the possibility that legitimate data in our databases could match reserved keywords?

The author of that article, Christopher Null, has an amusing workaround:

My usual trick is to simply add a period to my name: “Null.” This not only gets around many “null” error blocks, it also adds a sense of finality to my birthright.

In my goal to build systems that are easier to use, that should adapt to the consumer, this is clearly problematic. We shouldn’t be forcing Christopher to change his own name to fit our database model.

How would I go about dealing with this problem? Clearly when exporting data to CSV or generating extracts for other other third-party use, this problem has to be solved.

Delimiters are Good

The easiest way I can think of is for data types to be delimited correctly. When exporting to CSV format, we would ensure that all strings be surrounded by quotation marks. This way, Christopher Null’s record would show up as follows (assuming last name, first name ordering):

"Null","Christopher"

Then the parsing process on the reporting side would import that into whichever system without stumbling over the value.

Another issue raised by Christopher in his Wired article is that his email address is rejected outright by Bank of America because it contains the word null before the @ sign.

First and Last Name columns are bad

I’m going to call myself out on the above example and say that assuming names can fit neatly into two columns is bad. Firstly, Cher and Madonna would complain.

Secondly, Vice Admiral Horatio Nelson, 1st Viscount Nelson, 1st Duke of Bronté, would run out of space before his name could be entered correctly.

This doesn’t even begin to address names that are outside of our mainly Western perspective. In South Africa, I saw names that few developers in North America would consider, making use of all manner of punctuation (including spaces).

My recommendation here is to have a 250-character NVARCHAR column called FullName and leave it at that. If you really want your mail merge software to send “personalised” emails or letters, add an additional PreferredName column, and make sure it’s properly delimited when exporting.

Christopher Null wouldn’t have to bat an eyelid here. It would solve for his problem.

(While I’m on this matter, don’t ask for Gender or Sex. You don’t need it. If you do, you’ll know how to ask properly.)

Email Validation is Stupid

My third rule for ensuring that Nulls can exist in this world with Wests and Obawomiyes is to stop validating email addresses!

Firstly, most developers use a regular expression to do it, which is wrong and terrible and slow, and (clearly) imperfect.

This is a particular bugbear of mine, since I have one character before the @ in my personal email address. Many systems fail when trying to validate it, including WestJet. My own workaround is to create a westjet@ alias so that I can receive email from WestJet.

The best way to validate an email address is to send an email to that address with a link for someone to click on. If they don’t click on the link to close the loop, the email address isn’t valid, and your well-designed system won’t allow them to continue to the next step unless that link is clicked.

Summary

Three simple recommendations could help the Nulls find visibility again and ensure data inserted into a system is not modified unnecessarily.

For more reading on this and similar problems, I recommend visiting Bobby Tables.

If you have any experience in this area, please feel free to contact me on Twitter, at @bornsql .

A dalliance with distance

During a recent engagement, I was tasked with making a query faster. Since I signed an NDA (Non-Disclosure Agreement), I can’t go into detail, but it made extensive use of the STDistance() function of the GEOGRAPHY data type.

The query is being called many times a second, and owing to the way the application was designed, is effectively running in a cursor. For each row in a table that contains GPS coordinates, it has to calculate how far the principle GPS coordinates are from that row’s latitude and longitude.

As you can imagine, even with only a few hundred rows, this does not scale well. The database had hundreds of thousands of rows that needed to be calculated each time the query ran because these values are constantly changing.

The data was then being ordered on the STDistance result so that the nearest neighbour showed up first.

After some investigation, I discovered that the spatial index on the coordinates column was not set up correctly. However, we were unable to schedule a maintenance window to rebuild that index. Therefore, I was tasked to make the query faster without modifying any indexes.

To begin with, removing the ORDER BY clause made the query run much more quickly, even though it was running in a loop. After removing the looping structure and refactoring the query to be set based, it ran even more quickly. No surprises there, but the result set was now much larger.

However, it was still causing blocking on the table where coordinates were being stored, and the returning data was unsorted.

Enter the Great Circle Rule. In principle, it states that any distance on the surface of a sphere can be calculated using a simple mathematical formula (known as the Haversine Formula). If there’s one thing I know about computers, it’s that they perform mathematical formulas really well.

The haversine formula is not simple to express in a T-SQL query. Instead, we can use the spherical law of cosines, which is not as accurate (there is a maximum drift of around 2 metres, or a 0.3% margin of error).

(cos c = cos a cos b + sin a sin b cos C)

The advantage of this cosine law is that it’s one line of code, which means it can be used as a drop-in replacement for STDistance.

The customer agreed that for the purpose of this query, 0.3% was well within an acceptable error range. I helped the lead developer rewrite the query to replace STDistance value with the cosine formula, which increased performance by a factor of 1000 and did not need to use the spatial index at all.

In purely T-SQL terms, this looks as follows (the [Latitude] and [Longitude] are columns in the table I’m querying):

-- Latitude of source
DECLARE @Latitude FLOAT(53) = 34.09833
-- Longitude of source
DECLARE @Longitude FLOAT(53) = -118.32583
-- Diameter of the earth, in miles
DECLARE @Diameter FLOAT(53) = 3959

SELECT [ColPK], [Latitude], [Longitude],
ACOS(COS(RADIANS(90 - @Latitude)) * COS(RADIANS(90 - [Latitude]))
+ SIN(RADIANS(90 - @Latitude)) * SIN(RADIANS(90 - [Latitude]))
* COS(RADIANS(@Longitude - [Longitude]))) * @Diameter AS [Distance]
FROM [InterestingTable]

(Source: http://gis.stackexchange.com/a/40929)

Now of course this can be ordered and sorted as required. Because the distance is expressed in miles, we can filter the results to only display values that fall within a specific radius from the source, making the query run even faster.

If you have any neat geography data things to share, find me on Twitter at @bornsql .

Data Efficiency in SQL Server: DECIMAL

This is part two of a short series of posts about how I assign efficient data types when designing a new table or database.

Use less space with BIGINT

Last week, I spoke about a more efficient DATETIME than DATETIME. This week I have a more efficient DECIMAL than DECIMAL.

If you’re planning to store a long number that will never have a fraction, and you were considering using DECIMAL, I’m going to ask you to think about BIGINT first.

Granted, this is a very specific use-case, and requires some forethought, but that’s our job: to think ahead.

Here’s the scenario where I opted in for a BIGINT over a DECIMAL: IMEI numbers.

According to Wikipedia, the International Mobile Station Equipment Identity (that long number on your mobile device that identifies it uniquely—well, mostly) is usually from 14 to 16 digits in length, and may be as short as 8 digits in length.

A DECIMAL data definition requires two values: precision and scale.

Let’s say you want to store the value 123,456.789. The number of digits in total (both to the left and right of the decimal point) is nine. Thus, the precision is 9.

The scale is everything to the right of the decimal point, which in this case is 3.

In other words, we would define the this value as DECIMAL(9,3).

A precision of 1–9 digits only requires 5 bytes of storage, which makes this far more efficient than a BIGINT, which requires the full 8 bytes of storage, plus it cannot store the fraction.

However, as the precision increases, the storage requirements increase. A precision of 10–19 digits requires 9 bytes (a 45% jump in storage required), and this is the sweet spot where a BIGINT is an effective alternative.

If you do not need to store a fraction, as in the case of an IMEI number, consider BIGINT. 8 bytes will safely store a value up to 19 digits in length (for a maximum value of 2⁶³-1).

So if you’re storing IMEI numbers for a country’s worth of mobile devices, say 250 million rows, one byte will save you 238MB in memory, queries, storage, indexing, backups, and so on.

I think that’s worth thinking about.

Hit me up on Twitter, at @bornsql, to tell me how you’ve made efficient data design choices.

Data Efficiency in SQL Server: DATETIME

This is the first in a short series of posts about how I assign efficient data types when designing a new table or database.

Use less space with DATETIME2

We all know that the DATETIME column uses 8 bytes of space to store the date and time, to an accuracy of a paltry 3 milliseconds.

This used to cause no end of drama in the days before SQL Server 2008 because milliseconds were rounded to end in 0, 3 or 7. When trying to calculate the last possible time in a day, you had to work with some crazy values like 2016-03-15 23:59:59.997.

Fortunately, SQL Server 2008 introduced DATETIME2, which put paid to that drama. We now have a precision of up to 7 places after the decimal, and it still only uses a maximum of 8 bytes!

For example, we can now store a value of 2015-03-15 23:59:59.9999999 which is mind-bogglingly close to midnight, and not worry about insane values ending in a 7.

Do you really need that precision, though? How about a way to actually use less storage and store more accurate data?

If 0–2 places of precision (after the decimal) requires only 6 bytes, 3–4 places requires 7 bytes, and 5–7 places requires the full 8 bytes, we can save ourselves a whole byte and not lose any precision by using DATETIME2(3) as a drop-in replacement for DATETIME.

We get precision down to the millisecond (even more accurate than before) and shave off one byte for every row in that column. In a table with ten million rows, that’s 9.5MB of space we no longer have to store, or query, or index, or back up.

What data efficiency method have you used in production? Let me know on Twitter at @bornsql.

Lazy Loading and Tries

This post has nothing to do with SQL Server, but if you like performance tuning, stick around.

I learn technology by using it, pushing it to its limits, finding out how it breaks. I jokingly refer to myself as a living edge case because I will do something with your product that you might not have expected.

(If you’re looking for a beta tester, I’m definitely your man.)

By the same token, I write a lot of applications (in C#, my language of choice) to test theories and learn new things.

I have created a word puzzle game (no, you’ll never see it in the wild) that has as its list of permitted words a dictionary of approximately 300,000 English words. The list is a text file in ANSI format and smaller than 3MB.

The puzzle game works along the lines of the Scrabble® board game, with a set of random letters assigned to each player.

The trick for me was to limit the possible words from the main dictionary, based on the current player’s letters, to validate against. Unfortunately, even holding the full dictionary in memory in a List<string>() object was very (very) slow to filter. It was taking up to 7 seconds each time the letters changed.

I wrote to my friend, André van der Merwe, to ask him for help. My goal was to find all possible words with each letter combination, in the fastest possible way, ordered by longest to shortest. Performance is a feature.

André suggested I use a trie to hold my word list. This is principally how autocomplete algorithms work, where each letter in a word is the root of one or more words starting with that same sequence of letters. The computer reduces the list of possible words by following the path down the tree.

Trie Example

(I also call this a radix tree, but André correctly informed me that it’s not quite the same thing. Radix trees are more compact.)

A trie would make my search for words extremely fast because one of the properties of a trie is that each letter has a Boolean value (true or false) if it is the final letter of a word.

Unfortunately, every time I switched to a new player in the game, the entire word list had to be loaded into memory to be filtered against. For four players, this needed 250MB of RAM because a trie uses a class for every letter, and my word list was consuming over 60MB.

I work with data a lot. This is after all a SQL Server blog. I realised I didn’t need the entire dictionary in memory, ever. My players get up to 12 letters to work with, so I could eliminate words longer than that, which meant I could filter the word list before even knowing the player’s letters. What if I filtered the length and the player’s letters at the same time as building the possible word list?

Clearly this was a problem for lazy loading, a design pattern that says you should only load data into memory when you need it.

My solution was to read the word list off persisted storage (the hard drive) one word at a time, and if it had 12 letters or less, and could be made up using the letters in the player’s set, only then would it be loaded into that player’s trie.

Problem solved! Each player’s trie loads in under 30ms off SSD storage and 70ms if the word list is loading off a spinning hard drive. Even better, the memory footprint for each trie is only 12MB. This is still much larger than the 2.7MB of the List<string>() object, but a good trade-off with performance.

For reference, I eventually used this C# implementation, which André and I adapted.

What coding performance tricks have you used lately? Let me know on Twitter, at @bornsql.

Update to Azure Blob Storage Sync and Restore

Blob Storage Sync tool updated

During a SQL Server migration this month, I found some inconsistencies in my Azure Blob Storage Sync tool, so I made several improvements, and fixed an outstanding bug.

As you know, it relies on the naming convention provided in Ola Hallengren’s Maintenance Solution and comes in two parts: the AzureBlobStorageSync command-line application, and the AzureBlobStorageRestore command-line application.

New Features

I realised that it was not possible, using this tool, to download every file from the Blob Storage Container in one go. The code only downloaded the files necessary to perform the latest restore for a single database.

To resolve this, and allow all files to be downloaded from Blob Storage, I have added a new configuration key called DownloadFilesFromAzure, which takes True or False values.

Another new feature is an explicit option to upload all local files to Blob Storage during a sync. Previously, it was implied that all local files in the LocalPath should be uploaded, but you may not want to do that. This is implemented as configuration key CopyFilesToAzure, which takes True or False values.

Deleting Files

There are now two ways to delete files from a Blob Storage Container:

  • Files that do not match the source, that must be deleted off the target;
  • Files that are no longer needed on the target.

The first option is a typical synchronisation feature and was implicit in the previous version. It is now implemented using the configuration key DeleteMissingFilesFromAzure, which takes a True or False value. If it is set to True, files that do not exist on the local drive will be deleted from Blob Storage.

The second option is for deleting files that match a certain string in the file name. This is handy for server migrations where the file names generated by Ola’s backup script contain the old server name. While the backup script can perform cleanup tasks based on timestamps, it will ignore files that have a different server name, and you might be left with orphaned files long after the backup retention window has passed.

The configuration key, called DeleteExplicitFilesFromAzure takes True or False values, plus an additional configuration key, called ExplicitFilesToDeleteMatchingString. Here you can put a string containing the old server name, and any matching file with that name will be deleted. This particular feature only works with one string at a time. You will have to run it more than once if you need to delete file names that match other strings.

Fixes

Yes, the only logged issue in the repository has been fixed! I now use sp_executesql instead of EXEC for the T-SQL portion of the restore tool. This was probably the easiest thing to fix.

A more critical fix, and the main reason for this blog post and tool update, is to do with downloading files.

Imagine an all-too-common scenario where a download fails before it is complete. In an emergency, the last thing you need is your Restore tool failing. Whereas before I was simply using the name of the file, I now also check file size as well. If the file sizes do not match, the file will be downloaded from Blob Storage again.

Files now download from smallest to largest in size. This is a major benefit if you have a lot of small transaction log files.

Notes

Reading the list of files from Blob Storage takes about ten seconds for 2,500 files, before parsing can begin. This is not a lot of time, but it’s something to keep in mind.

Feature Requests and Bug Reports

If you have any feature requests, or have found a bug, please log that on the GitHub repository. Even better, if you know C#, you can add your own features and fix your own bugs, submit a pull request, and I’ll merge your code into the tool.

Questions, comments or complaints? Twitter is the place: @bornsql is where you’ll find me.

Survey: Light or Dark

This week I am continuing with the simple survey series. Today my question is, do you prefer using a dark or light background when writing code?

I like a light background with dark text (usually black text on a white background), because I find it difficult to see when I change between different applications, and even to see clearly when I look away from the screen. Visual artefacts suck.

How about you? Sound off with me on Twitter: @bornsql .

Survey: Tabs or Spaces

This is my second survey for the month of December.

Last week I asked about join predicate order. This week, it’s a simple question:

Do you prefer tabs or spaces to indent your T-SQL, and why?

I prefer to use tabs, because that’s how I learnt it in software development. Not that you asked, but my default tab size is 4 spaces.

Being able to adjust the tab size makes using spaces redundant, for me anyway.

Hit me up on Twitter (either @bornsql ) to have your say.

Pre-Announcement Announcement

On Saturday, 27 June 2015, I will be presenting at SQLSaturday #407 in Vancouver BC. The session is titled “Restoring a SQL Server Database from Azure Blob Storage”.

I will be releasing some new content and code at the same time as my talk, which will be available from this site, assuming WordPress plays along.

Once the content is available, you’ll understand why I’ve been so quiet on the blog.

Stay tuned.