Relational databases aren’t the problem

Content warning: This is the first of two rebuttal essays, about why someone is wrong on the Internet. It is no doubt biased. It might go into technical detail. Parts of it may be wrong. It may contain flippant remarks and editorialising. There are links to external references that may distract the reader.


I am a published author. Part of the fun of being an author is promoting the books you’ve written. To avoid any implication that I might be promoting my own book, I’m going to remove all doubt and get that out of the way here: please buy my book (Amazon.com affiliate link).

Why is this relevant?

In June 2018, a software architect named Lance Gutteridge wrote a number of opinion pieces about what he feels is wrong with relational databases. In every single case, there is a link or reference to a book the author has written, titled “Avoiding IT Disasters”. Since I have not purchased nor read the book, I cannot confirm that what’s referenced actually appears in the book.

Now of course I want to mention bias. When I studied journalism at journalism school (as opposed to, say, computer science school), I learned about bias. The idea is that most writing is subjective. To be objective about a topic makes for very dry, very dull reading.

Bias is important because it spices things up. Now, a piece of writing takes a position. A side has been picked. The writer is making a point in order to achieve some goal. Perhaps that goal is to sway public opinion. Perhaps it is intended to sell advertising. Perhaps it is intended to sell books.

When an opinion takes an extreme point of view, as is the case in the opinion pieces to which I am about to respond, the writer must take great care to support their convictions — especially in the face of mountains of evidence to the contrary — lest they find themselves the subject of scorn.

To that end, this essay is intended to address the writings of a person who — in my opinion — is provably wrong, and whose advice should not be heeded when it comes to relational databases, a topic near and dear to my heart. It is biased, because I work with relational databases. I have also had some experience as a software developer and software architect. While Gutteridge is a Ph.D., I have no such degree, however I am a Chief Technology Officer. I have designed and written software for at least one large financial institution (the actual number is larger, but this isn’t a measuring contest). I also worked with PeopleSoft (an Enterprise Resource Planning tool, or ERP) back before Oracle bought them.

Rebuttal

This essay counters the opinions of Lance Gutteridge in a number of ways. Where technical errors have been made, I will present corrections. Where opinions have been made, I will present my own counter-argument.

There is a common term you may have seen online or in print, usually regarding technology: Fear, Uncertainty, and Doubt, or FUD. It is my belief that Gutteridge is peddling FUD in order to increase sales of his book, and possibly increase revenue for his business. That is the position (or bias) of this essay.

If we review just the titles of his opinion pieces, we can see a general negative trend:

  • What I’m Telling Business People About Why Relational Databases Are So Bad
  • Databases: The Future Buggy Whips of Software
  • Doing Without Databases in the 21st Century
  • Risking Other People’s Data By Using Relational Databases
  • The Courtroom Trial of SQL: Transcript
  • Why Are Relational Databases So Bad But So Popular?

These are just the pieces to which I am responding. Exploring Gutteridge’s other pieces, one will see similar negative sentiment. A casual observer might say that Gutteridge has an axe to grind — or a bone to pick — with relational databases. It raises a number of questions, and the responses by the general public to some of these pieces are quite strong. Clearly Gutteridge has touched a nerve with the relational database crowd. After all, you’re reading this essay.

There is one question I must pose in light of this negativity. Is Gutteridge trolling us? 2018 has proved — if nothing else — that anything is possible in the world today, especially when it comes to “fake news” and social media pile-ons. Perhaps his series of posts is a way to garner attention. Even if this is an elaborate hoax, it falls comfortably under the “increase book sales” reason mentioned previously.

“Why Relational Databases Are So Bad”

I will begin with the first piece, dated 1 June 2018, titled What I’m Telling Business People About Why Relational Databases Are So Bad.

The image Gutteridge chose for this piece is a screenshot of SQL Server’s DBCC CHECKDB output, showing corruption in a database, which is an interesting choice considering that this image is demonstrating a relational database’s best feature, namely maintaining the consistency of data.

Getting into the body of the piece, the opening paragraph begins with book promotion (just as I did in the beginning of this essay), but also notes that “relational databases are the cause of so many problems in enterprise systems”. While Gutteridge enumerates some issues in the piece, they cannot be directly attributed to specific flaws in the relational model, as I will demonstrate.

As we read further, there is a short history lesson from Gutteridge’s point of view, and finally a comment about how the word “relational” is mathematical in meaning, and has nothing to do with how human beings see relationships between things or people.

In his own words:

This is completely removed from our business-centric concept of a relationship, where there has to be some kind of a connection. If you had a group of pairs of objects and someone told you they were related, you would look for some kind of connection.

The relational model explicitly states that there must be a relationship, or “connection”, between objects in a database (represented as tables), and in the tables themselves. In fact, it’s the leading principle of relational design. Let’s consider the example of a relational database used by a school. There might be a table representing members of staff at the school, and another table representing classrooms. The relational model expects that there is a logical, recognisable, appropriate connection between the columns and rows in each table, as well as a relationship between those two tables where members of staff are assigned to a classroom. The implementation of this relationship can take different forms depending on normalization, but if there is no connection, it’s simply not a relational database.

Then, this pronouncement:

They can be considered a relationship even if they are just paired by chance.

No they cannot. If this is happening in your relational database, you’re designing it wrong. Using the industry standard of third-normal form, this should never happen. Structured data is organised into logical segments for a reason. If grouping happens by chance, it is not structured data, and you should immediately hire a data professional.

Telegrams and user inputs

Once again we step back in history as Gutteridge discusses the Anglo-Boer War. Specifically, the Second Anglo-Boer War, where the English were routed by the Boers. I was born and raised in South Africa, so I’m intimately familiar with this particular war.

Gutteridge takes exception with the fact that telegrams use the word “STOP” to spell out punctuation in order to avoid ambiguity where the text might be marked by dirt, and that somehow this is akin to relational databases using a stream of readable text to perform operations, separated by a stop word.

Firstly, I think he is confused by the GO batch operator, common in SQL Server, which is not even legal Transact-SQL (T-SQL). GO is for the database client (and for human operators) to make code more legible.

But as we read further, his main complaint is a concept called SQL injection. According to Gutteridge, relational databases are bad because SQL injection is possible.

Yes, SQL injection is a real thing, and is the result of permitting unsanitised user input in an application. That’s not the relational database’s fault. Like any computer software, a relational database will do what it is told, even if the instructions are in a language that looks a lot like English and contains stop words.

SQL injection is most often caused by inexperienced software developers, not database administrators. Any poorly-designed interface that allows unsanitised input should never be put into a production environment. Period. If a Chief Technology Officer and Enterprise Software Architect is designing systems that allow SQL injection, they have no place designing systems. All data that is entered by a user should be validated and sanitised before it gets to the database. This is not hard to do, and most programming languages do it. It even has a name: parameterization. What this means at a technical level, is that the user input is transformed into harmless text (“encapsulated”, using QUOTENAME in SQL Server, for example), instead of being executed as a command by the relational database.

The wealth of mitigations against SQL injection are such that Troy Hunt has produced a training course about it called OWASP Top 10 for .NET developers part 1: Injection.

Remember that this writing to the database is being done by a task that has to have sufficient rights to update the database.

This statement baffles me. On the one hand, Gutteridge claims that relational databases are bad because SQL injection is possible. On the other hand, you need to have sufficient rights to make changes to the database. Which is it? SQL Server (and other relational database management systems) can assign permissions to every single object in the database. Users might have read-only access to certain tables. Updates to the database might be performed exclusively through stored procedures, with the underlying tables safely protected. All data and table modifications can be tracked and audited.

Gutteridge also fails to mention that unsanitised user input can affect compiled application code as well. All software must operate in computer memory. While we think of data and the application code itself as separate, the boundaries can becomes hazy in memory.

Consider the example of a Windows application with a simple textbox. Assume that there is no validation on the input, and no limit on the size of data that can be pasted in there. Now imagine what would happen to the application (and the computer memory) if I pasted the complete works of William Shakespeare into that box. What will the application do? If it is not written well, the application might crash, and leave all that data in memory (this is one form of what’s called a memory leak). I could restart the application and do the same thing, eventually causing the computer to run out of memory and possibly crash.

A more nefarious case would be entering specific non-printable characters (invisible to us, but recognisable by the computer’s memory) into the textbox, followed by malicious code. Let’s say it’s a C++ application and I place a null terminator character \0 in there, followed by some machine code. If the code is crafted correctly, the application will accept the input, terminate the string at the null terminator, then process the machine code that we pasted after it as though it was executable code. While I’ve simplified the example, this is a common attack vector called a buffer overflow.

This is just a terrible architecture and it is responsible for billions of dollars in damage to organizations around the world.

The lesson is not that SQL injection makes relational databases bad. The lesson is that software developers should validate and sanitise user inputs, not trust any data provided by users, and don’t let application code make changes directly to the database without applying appropriate permissions.

Don’t Repeat Yourself

Gutteridge’s next complaint is that the database is duplicating effort already defined in the software (the application that runs in front of the relational database). It’s not quite clear what the problem is from the text, but I think he’s complaining about normalisation. Normalised data is the epitome of “Don’t Repeat Yourself” (the DRY principle to which Gutteridge refers), so this complaint is contradictory.

As my regular readers know, normalisation is the process at design time of breaking up of data into its smallest possible parts, so that the database is more efficient in terms of storing data and ensuring consistency (I refer to this step as ensuring a single version of the truth). At the design stage, this adds complexity when writing queries to retrieve data because it requires table joins, but that is entirely the point of relational databases.

I will be the first to admit that this process is complex, and it can affect performance if designed incorrectly, but it happens once. Then you create stored procedures to keep these queries readily available for the application to use. Using Gutteridge’s example, if you need the Timesheet data as well as the Employee data, you can use that particular stored procedure you wrote to bring it all back as needed, without having to do two trips to the database. Add in a caching layer for data that doesn’t change as often, and your application will be even faster.

Normalisation can be tricky and there is a point where you can overnormalise a database, so it must be balanced against efficiencies in storage, data consistency, reduced duplication of data, and ease of querying the data (a step often overlooked, apparently). When an application needs to write data back to the tables, it must necessarily break it up so that each table is updated correctly. Yes, that’s complicated, but it’s no harder than writing enterprise application code. A seasoned database developer would not bat an eyelid at this task. Also, an Enterprise Software Architect would have reusable code that handles a lot of the heavy lifting already.

I concur that normalisation is tricky, but this does not make it bad. Gutteridge goes on to write that data can get out of step because the software is somehow not communicating with the database correctly, and I’m left wondering whether he has ever even used a relational database.

Data Types

I have written many words on my blog about data types. One of the first principles of relational databases is that data needs to be stored correctly, to ensure consistency and trustworthiness.

Gutteridge again:

[T]he data in a relational database is stored in ways more in keeping with a 1980s programming language than with a modern, object-oriented language. All of the data in today’s modern, object-oriented languages have to be encoded into these primitive data types.

Firstly, the original SQL standard is older than that, and that’s a good thing. Like music theory, one of the great things about established computing standards is that we don’t need to reinvent the wheel. Data must be encoded into various data types according to their classification. Let’s check out some of the “primitive data types” supported by Microsoft SQL Server 2017:

    • BIGINT: a whole number ranging from -2^63 to 2^63 – 1.
    • FLOAT: really big and really small floating point numbers.
    • DECIMAL: Accurate decimal values with custom length and scale.
    • DATETIME2 / DATE / TIME: dates and times, with related data types for time zone support.
    • CHAR / VARCHAR: ASCII text, common for Western alphabets that only need 1 byte per character.
    • NCHAR / NVARCHAR: Unicode text, common for alphabets that need more than 1 byte per character.
    • VARBINARY: blobs or images, which, let’s face it, you shouldn’t really be storing in a database.
    • XML: data that can be expressed in XML format.
    • And even more boring stuff like INT, BIT, TINYINT, UNIQUEIDENTIFIER, etc.

Looking at that list, I’m struggling to see a need for a data type that isn’t already covered. Even so, let’s humour his argument and consider object-oriented languages. I happen to be familiar with C#, a modern, open-source, object-oriented programming language originally created at the turn of the century with built-in memory management, and both dynamic and static typing. In C# you create complex objects called classes (or structs if you like to abuse memory).

These objects are made up of individual data types, which are in most cases called value types (or “primitives”), such as int. You could have reference types as well, like string, DateTime, Guid, Xml, and even other classes.

As a thought experiment, let’s define a class in C#, and see how the underlying SQL Server table would look, bearing in mind that SQL is more than twice as old as C#.

C# class

SQL table

What was his point again? Even if we have classes within classes, or make use of object inheritance, this can be transposed into SQL objects using lightweight ORM (object-relational mapping) tools like Dapper or Entity Framework Core. In fact, if an enterprise software development company is not using an ORM, I have questions.

Complexity

Gutteridge, who is the CTO and software architect for an enterprise software development company, writing enterprise resource planning software, has one final comment about why relational databases are bad:

Take an older code base with a lot of additions and changes by different people over the years, then try to customize it to a new situation, and throw in all this added complexity of a relational database, and you are putting the project seriously at risk.

He almost got it, but let the relational database take the blame. Writing software properly is hard. Designing a good relational database is hard. Maintaining older systems is hard. Planning is key in any software development and maintenance project, and storing data is just one aspect of that.

Summary

Instead of railing against relational databases for all the wrong reasons, we should instead take a closer look at how relational databases have made software applications easier to write by abstracting away many issues around performance and consistency. As with any complex endeavour, you need people with the right skills to make that achievable, especially with enterprise software development, and it’s important to employ subject matter experts who can identify where your pain points really are.

Next time, I will respond to Gutteridge’s follow-up article, “Doing Without Databases in the 21st Century”.

Leave your thoughts in the comments below.

Photo by NeONBRAND on Unsplash.

14 thoughts on “Relational databases aren’t the problem

  • On Twitter, Cade Roux writes:

    I would just point out that relations ARE the tables. A table is defined to contain all the columns that RELATE to a candidate key. The “relationships” between tables is not where the “relational” part of RDBMS comes from. It comes from that very first part of normalization.

    • I’d have to disagree, or at least modify, this statement. Relations (relationships) are NOT tables. Tables do, however, contain data ‘columns’ related to an entity, represented by a ‘row’. However, the real ‘relationship’ is how the entities are related to other entities. Anyone who spent earlier years working with hierarchical databases would probably agree with me that ‘relational databases aren’t the problem’. The real problem is poor development of relational databases, of which I’ve seen a few – a few too many. Relational databases are the ‘cat’s meow’ if designed and accessed correctly, thus leading to far better code development than we saw in earlier days.

  • Thanks for writing this, Randolph. Given everything you point out in the rebuttal section, it’s harder to take his motive at face value.

    Also, Gutteridge says, “Take an older code base with a lot of additions and changes by different people over the years, then try to customize it to a new situation, and throw in all this added complexity of a relational database, and you are putting the project seriously at risk.” Strike the relational database part and the sentence is still true: Customizing years-old, evolving legacy code to a new situation is going to put the project at risk. Adding any other component — a NoSQL database, a caching layer, a new authentication method — they’ll all add complexity because software is complex. Like you said, storing data is just one aspect. It can’t be blamed for all the problems that may arise from the system as whole.

  • I pretty much agree with you here, though you’ve left out that NoSQL databases are vulnerable to injection as well. Anytime the database engine works with input instead of strict save/read, there are possibilities here.

    The other thing I think is missing is that there are problems with relational databases and many applications, but there are knowledge issues. While that’s not strictly an RDBMS issue, it is a reality that far too few people understand how to model, implement constraints, etc. In addition, few too few are implemented, because of the limitations and binding these create on applications that are often poorly specified from the beginning. Again, not the fault of the RDBMS, but the nature of putting things together today.

    Of course, many of these same issues arise in the application layer, where you must account for them. It’s easy to alter a JSON document (or similar structure) to evolve it, adding or removing columns. However, the application has the burden of maintaining complexity. This shifts the debt to the app code instead of being the database schema. It’s still there and if not handled, there can be issues with the application.

    Lastly one of the strengths of the RDBMS comes in searching multiple rows and reporting on them. While I think the performance of singletons (both read and write) can be dramatically better in non-RDBMSs, when it comes to aggregation and ensuring things like consistency or uniqueness over time, the RDBMS is much better. Most of the bolted on functionality of a NoSQL store comes in the form of really building RDBMS like tables.

  • I would find it difficult to argue with anything you have said here. The fact that a fully fledged algebra and calculus backs up the relational model puts it head and shoulders above most, if not all other tools.
    The skill is in modelling these relationships correctly, as real world relations tend to be messier than the relational world.
    Good work.

  • Thanks for putting him in his place. Any DBMS is only as good as it is being used. Failure to recognize its strengths and use them can be easy in today’s environment.

  • At some point the code-centric developers (I’m talking to you, COBava guy) will have to admit that all this post-relational/NoSql/whatever spiel is just 1950s FLOW-MATIC deja vu; lots o code manipulating dumb files. In the early days of the Web, doing editing (modulo injection issues) twice (browser and database) was considered wasteful, which led the Perl/PHP/java folks to view the RDBMS as just an easily backed up, unified file system. As one of my COBava colleagues said (in the 2000s, no less), “We prefer to do transactions in the client”. Bah. With today’s innterTubes speeds and persistent connection, we’re back to the paradigm of VT-200 to RS-232 to Host. Dumb terminal to smart application. Besides the integrity aspects, the real value of relational/SQL databases is easier maintenance. Consider what happens if one starts with an orthogonal schema, however minimal or not. As changes happen, and orthogonality is maintained, what can we deduce about schema changes? Isn’t it true that such changes will only be visible to application code interested in such? Will there be redundancy or side effects elsewhere in the application?? I find not. win-win all the way around.

    Client code-centric folks see all those redundancy slips heading their way. “We must preserve are legacy!”

  • Great rebuttal. My concerns around NoSQL data structures is just that. They have no structure. A JSON document can contain any type of data it wants. While that may be great from a development standpoint, it’s a hindrance for security, auditing, and compliance purposes. If you haven’t heard, the EU just implemented GDPR (General Data Protection Regulation). GDPR comes with stiff fines for non-compliance. If a user opt’s out and does not want you to store their personally identifiable information (PII), you must not store it or you will face some very stiff penalties. How will you find that in a NoSQL database full of JSON documents? How will you be able to comply? Can you guarantee 100% that person’s data is no longer in your NoSQL database? Good luck with that! With an relational database, it will be much easier to find that data because the tables and columns are designed to hold that PII data and we know what code that populates those tables.

  • Very good – a couple of comments:

    1. Stored procedures are unrelated to the number of trips to the database

    Whether getting the example Timesheet and Employee data takes one or two trips to the database is a logic question, regardless of whether the SQL is in stored procedures – either make one request up front that includes a join, or make two requests with no joins.

    2. Using prepared statements eliminates SQL injection risks

    Not mentioned in either paper is the way I always write SQL (in eg Java), which is to use prepared statements, where the SQL contains just a questionmark for each input value, and values are passed as an array. Admittedly at the price of having to provide the values in a correctly-ordered array, but in any SQL with local variables that can be made clear by assigning the values to variables at the start (which also eliminates the need to duplicate values that are used multiple times in the SQL).

    3. Relationships between tables have always been integral to relational databases

    While Codd’s original paper does limit “relation” to “a set of n-tuples” that we now refer to as a single table, it also talks about “foreign keys” being used for “A common requirement … to cross-reference … elements of … relation[s]”, so Gutteridge’s implying that relational databases do not handle this “business-centric concept” is a straw man fallacy.

    • Isn’t it meaningful that the trips to the database are also a function of ‘distance’? Stored proc access is, more or less, ‘in’ the database (whether in the same address space, or just the same machine…), while client access is some greater ‘distance’. In the context of web sites that ‘do transactions in the client’, aka the browser, there’s not just a logical difference but performance as well if there are multiple trips to the database.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: