Canada Day 2020 - Born SQL

It’s Canada Day here, the commemoration of the country’s confederation in 1867. Given events over the last few months such as the worldwide pandemic, followed by a number of politically-motivated movements for change across the United States of America and other countries, it’s a good idea to reflect on what database design might look like in a more inclusive and diverse world.

I’ll start with just three seemingly minor fields: name, address, and gender. As previously discussed on this blog, surnames can be fraught with problems. As well, given the awareness of gender diversity these days, it requires that we database designers should move with the times and be more inclusive.

Names

I was born and raised in a country with a long history of colonialism, as I’m sure are many of my readers. There is a history beyond that which was taught to us in school, beyond the arrival of colonialists at our shores. In the case of South Africa for example, colonial history started with Jan van Riebeeck’s arrival at the Cape of Good Hope (later Cape Town), with a minor mention of Portuguese explorer Vasco da Gama before him who discovered a route to India over the ocean.

Van Riebeeck’s surname was van Riebeeck. Da Gama’s surname was da Gama. In the SQL Server database of imperialism, this would be a simple ASCII column — probably VARCHAR(150) — with no need for Unicode support. Depending on which empire you lived in you might have a different code page for Dutch or Portuguese, but that’s pretty much it.

But there are people with names that don’t follow this first-name last-name paradigm. There are people with mononyms (one-word names). There are people with one-character or two-character surnames. There are whole countries that put their family name (what we might call a surname) before their first name, not to mention those of us who have inherited names but go by a preferred name in day-to-day life. In Canada, some indigenous people might put their colonial name first followed by their indigenous name, and in some cases a description of what it means.

The point is that the first-name last-name way of storing names is not inclusive, it’s obsolete. The way we store data should allow for a more diverse audience. So I propose when designing your next database that stores people’s names, do it as follows:

Column	Data type	Description
FullName	`NVARCHAR(512)`	The person’s full name
PreferredName	`NVARCHAR(256)`	How they wish to be addressed (in emails, when logging in, and so on)

This solves several problems:

People with mononyms don’t have to fight with a required surname field.
Special characters are fully supported, including apostrophes, grave accents, and other diacriticals.
Email and other interactions can use the preferred name without having to guess where the first name ends and the last name begins.
Names are stored as variable length already, so allowing people the opportunity to use more space if needed is just good practice.
Using row and page compression will reduce the storage requirements for Unicode strings.

Addresses

Recently I was discussing an issue with a colleague around the limit of five lines for a physical address that was engineered in a database we designed 7 years ago. It works most of the time without any issue, but in several special cases we have had to figure out how to parse line breaks and commas.

I used to work for a company in South Africa that distributed high school content via VHS, and we had some addresses in our database like “2 kilometres past the big oak tree.” On the other side of the scale, we had office park addresses with a unit number, in a building, in a block, on a street, in a suburb, in a city, in a province, plus the postal code. In cases like this, the best thing to do is still have separate NVARCHAR(256) columns for the city, province, and postal code, but a single Unicode field for the main body of the address (perhaps NVARCHAR(1024) for example). People can separate values by commas when they enter their data.

Addresses are messy at the best of times, so parsing commas might still be necessary, but forcing them into a limited set of lines is messier.

Gender

This one is easy. Stop storing gender. If you need a gender for medical or legal reasons, okay, but make it NVARCHAR(256). Done and done.

Leave your thoughts in the comments below.

Photo by Maarten van den Heuvel on Unsplash.