Content warning: This is the second (and final) rebuttal essay about why someone is wrong on the Internet. It is no doubt biased. It might go into technical detail. Parts of it may be wrong. It may contain flippant remarks and editorialising. There are links to external references that may distract the reader.
Last time, I reviewed and responded to an opinion piece by Lance Gutteridge, who holds a Ph.D. in Computability Theory. I have not done much research in recursion theory and computability theory, so I can’t pretend to know as much as Gutteridge in this field of study, however our knowledge overlaps in terms of set theory, reducibility theory, and hierarchies.
As with my previous essay, I will respond to the key points of the opinion piece in question. Where technical errors have been made, I will present corrections. Where opinions have been made, I will present my own counter-argument.
There is a common term you may have seen online or in print, usually regarding technology: Fear, Uncertainty, and Doubt, or FUD. It is my express belief that Gutteridge is peddling FUD in order to increase sales in his book, and possibly increase revenue for his business. That is the position (or bias) of this essay.
“Doing Without Databases in the 21st Century”
This essay responds to a second opinion piece by Lance Gutteridge, dated 19 June 2018, titled Doing Without Databases in the 21st Century.
The premise of his piece is to present a purely software-driven implementation (in a Java-like language) of data persistence without needing a relational database. Gutteridge shows his software architecture colours here by setting out the basic principles of designing such a system. I have taken the principles and written them here as a list (with minor paraphrasing), which I will step through later:
- keep everything in memory, because it’s fast;
- persist the data to a hard drive for safety;
- if something goes wrong, rebuild the data state in memory;
- follow ACID compliance.
If anyone has read a book about relational databases recently (for example, SQL Server 2017 Administration Inside Out via my Amazon.com affiliate link, published by Microsoft Press in 2018 and featuring yours truly as a co-author), this list looks an awful lot like how a modern relational database management system (RDBMS) operates.
Let’s consider some of the features that SQL Server offers:
- a buffer pool: to keep data pages in memory, because it’s fast;
- a data file: to persist the data to a hard drive for safety;
- a transaction log: to rebuild the data state in memory after a failure;
- follows ACID compliance.
As a Microsoft Data Platform MVP and co-author of a book about SQL Server, I am biased. Of course I am, otherwise this essay would not exist. Nevertheless, it appears on the face of it that Lance Gutteridge is proposing to reimplement a relational database management system in order to avoid using a relational database management system.
This flies in the face of the DRY Principle (“don’t repeat yourself”) that Gutteridge took to heart in his first piece, and ignores over four decades of research by other computer scientists with doctorates, resulting in what appears to be an imitation of the very thing he’s trying to avoid. One might suggest, if one were fond of metaphors, that the author cannot see the forest for the trees.
Keep everything in memory
Gutteridge’s initial observation is that relational databases need to be replaced because when they were invented, computer memory was expensive. As a result, he posits, relational databases are immensely complicated due to disk access patterns from the 1980s. The answer he offers therefore is to ensure that all the data is kept in memory.
As a typical enterprise software developer, Gutteridge assumes that his application is the only one running on that computer:
When you have 256GB memory on a computer it is hard to find applications whose data cannot fit in this. Certainly those kinds of applications exists, but they are outliers.
This is easily disproved by looking at any production server. Even so, if we allow for each enterprise application to have its own server with 256 GB RAM, we can agree that it is a good idea to keep your data in memory. When data is closer to the CPU, latency is reduced. However you certainly don’t want all of your data in memory if you’re not going to be using it. That’s a terrible waste of resources, especially if you load the entire dataset into memory each time your application starts up. You want to be able to pick and choose which data you need, then as it ages out you asynchronously write any changes to a persisted storage layer, and reuse the memory for more relevant data.
In fact, Gutteridge confirms this in his next statement, possibly his most accurate in the entire piece:
When you are dealing with disk access times, especially slow ones like in the 1980s, you do everything you can to figure out exactly what disk sectors you want to read because you don’t want to read any unnecessary ones and whenever possible you try and read contiguous groups of sectors, preferably a whole track, because it is faster.
Although Gutteridge is glib in his description of disk access times, and ignores the fact that databases keep data pages in memory as a rule, it is based in fact: you want to keep only the data you need in memory, so that it is readily accessible. This holds for all data access, not just relational data. If you open a Word document, it needs to be read off the drive and copied into memory. On modern versions of Word, the file format is compressed so it must then be decompressed. You certainly don’t want all of your uncompressed Word documents taking up memory if you just want to edit one page of one document.
This is not explicitly a problem with relational databases. As we will see later, making appropriate use of memory will affect Gutteridge’s own solution. You cannot change the laws of physics: memory is faster than storage because of how it’s made and its proximity to the CPU, but you have to be careful with what you put into it.
[A] database table is spread over disk sectors in a complex pattern dictated by the file system software. One approach database companies took was too take over a large hunk of the disk and run their own sector management system inside it. That meant they were essentially writing a file system that was trying to access data with the minimum number of disk accesses. That is an immensely complicated problem and the solutions were immensely complicated programs. Of course, immensely complicated invariably means slow and buggy.
Where to begin? Firstly, complicated algorithms do not immediately mean code is slow and buggy. Hedging on the word “invariably” is not going to let Gutteridge off the hook. Code that has been improved over decades (as is the case of SQL Server and Oracle) is extremely complex, but light on bugs. If this complexity is as slow and buggy as he claims, perhaps Gutteridge can explain why modern file systems take their cues from database file structures.
Let’s talk a little about indexes, with this statement from Gutteridge:
Another consequence of doing the data access using the disk records is that to get any reasonable performance you need a large amount of indices. This causes the size of relational database on disk to swell to enormous sizes. It is not uncommon to see relational databases whose disk size is 20 times the table sizes. This means that the disk image of a relational database is often 95% overhead.
I would love to know which databases Gutteridge is referring to that carry 95% overhead, so that I can ask those DBAs why. Yes, indexes make database queries faster. Do you know why? It’s because when you load an index into memory, it uses less data than the entire table, so the queries run faster.
No, you don’t need a lot of indexes to get reasonable performance. You need to normalise your database properly when you initially design it, implement proper clustering keys, and make sure you apply proper indexes. It is an art more than a science: indexes must be highly selective, they must be covering (in other words, they can satisfy the requirements of the query without needing to go to the underlying table), and they shouldn’t be placed on every single column.
Entire books have been written on indexing, and anyone working in a professional capacity should probably read at least one of them if they don’t know how indexes actually work, rather than guessing.
Storage technology has not stagnated in the last four decades. Firstly, the world is steadily moving from spinning drives towards solid-state storage in its many form factors. Solid-state’s greatest appeal is that it eliminates rotational seek time: latency caused by a drive head seeking a block of data on a spinning magnetic surface. There are problems such as write-amplification which reduces solid-state storage lifespan, but using a number of mitigations including RAID, we can adapt accordingly. Overall though, storage is performing really well these days (pun intended).
RAID (which stands for redundant array of independent disks) is implemented in solutions like SANs (storage area networks), NAS devices (network-attached storage), DAS (direct-attached storage, literally the drives inside the server) and so on. I said this in my book: “When done correctly, combining disks into an array can increase overall performance [and] lower the risk of data loss, should one or more of the disks in the array fail”.
Performance is a feature. RAID arrays, flash storage, persisted memory (not quite there yet, but close) all help modern RDBMS solutions like SQL Server perform at sub-millisecond response times on billions of rows. I was recently at a customer where their SAN latency was measured in microseconds.
Performance matters in memory as well. What a lot of software developers seem to forget is that because memory is so much faster than storage, that speed hides poor coding practices until the application has to scale (i.e. the amount of data it is processing increases). This is especially evident with developers who write enterprise applications that assume there is no other software running on the same computer.
With that in mind, we look at Gutteridge’s proposed implementation of a persisted data store. Just as Brent Ozar Unlimited uses the freely available StackOverflow data for examples involving SQL Server, Gutteridge has implemented his own StackOverflow-like system. His examples uses a Java-like syntax. Fortunately, C# developers like myself will find this easy enough to read. It doesn’t hurt that I briefly taught Java at a computer college in South Africa. To paraphrase a great line from the movie Jurassic Park: “This is Java. I know this.”
I will try to simplify this for my readers who aren’t programmers. In his code, Gutteridge takes a generic forum object (a question, answer, or comment) directly from the HTTP(S) stream as it comes in. Once it has been received, the byte stream is then serialized in order to write it to a persisted data store. Reading it back from the persisted store, the object is deserialized again into the ForumObject. The object has a header to let the persisted storage know the object type and length.
When implemented as a ForumPost, there is a unique identifier (as a
long, or 8-byte wide field). I like unique values to describe objects. That’s a good thing. There’s also nothing wrong with planning for the future and assuming there will be more than a few billion objects in this data store. Of course if there are four billion posts, the unique identifier alone will take up 32 GB of space, but that’s neither here nor there if the server is dedicated and has a lot of memory available.
Each ForumPost has a couple of ArrayLists (as post replies which are 0..n ForumPosts themselves, and a bunch of keywords). In Java, an ArrayList is a resizable object, so when it needs to grow, it does so without requiring manual resizing. When you add new objects to it, the time it takes to add 1000 objects is 1000 times longer than it takes to add 1 object. This is called linear time, or O(n) in Big O notation. Additionally, the object is not synchronised, which means it requires locking to modify the array in any way.
Then the next thing Gutteridge has is the data access service itself, implemented in principle as a HashMap. In Java, a HashMap is almost the same as a Hashtable, except you can have null values. Great, null values have their place if you aren’t sure what will be going into the object. A HashMap is not synchronised, so it also requires locking to make any modifications.
A minor quibble I have is that there is a performance concern with a HashMap, per the official documentation:
When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
What this means is that the HashMap must be kept at a certain ratio, or “fill factor,” to avoid expensive rehashing. Even though this is all happening in memory, the data structures need to be copied elsewhere while the rehashing takes place. Will there be enough RAM to do this? Well it depends on the current working set of this application. Perhaps the operation might cause the application to make use of the operating system’s page file (what we relational database folks call “disk spill”). We don’t like spilling to disk because as we already know, memory is much faster than disk, so this could cause a major performance headache.
If we look at how SQL Server works, every single object is implemented as a data page which has a 96-byte header. Depending on the type of object, the data page will show that in the header. Then looking at tables in particular, data is stored either in a heap or a clustered index. A heap is unsorted, usually used for staging data in ETL processes and indexes don’t matter because the entire table is being read.
Both clustered and non-clustered indexes indexes are stored as B+trees. A non-clustered index is just a copy of the data, while the clustered index is the actual data itself. The B+tree (or “balanced tree”) will reorder data pages as the index grows, to reduce the number of seeks and scans it takes to find the right data in that index.
Recovery from failure
The prime goal of a persisted data store is to ensure that the data is persisted correctly. In ACID compliance — which is covered in the next section — this is the D, which stands for “durable.” Here’s how Gutteridge implements it:
So we can regard the data access service as a consumer of a series of operations. These operations update the data model (in this example basically a bunch of lists). As we shall see the advantage of this is that the state of the data model can always be rebuilt by replaying all of the operations in the order they were originally encountered.
A “series of operations”, “in the order they were originally encountered” is an append-only model. The advantage here is that you can exactly reproduce the state of your data, assuming no corruption or loss during the storage writes due to disk failure, memory corruption, power surges, and so on.
When you want to reconstitute your memory state after an application failure, the entire log must be re-read, and all these objects must be put back into memory from a serialized byte stream. If you have a lot of data, this can take some time. Storage is slower than memory, and even a decent solid-state drive averages around 200 MB/s. Even if you have 256 GB RAM for your dedicated enterprise application, it could take 20 minutes to restart the application.
But, I hear you cry, we can run this in parallel! Not exactly. Remember where I noted that ArrayList and HashMap objects are not synchronised? In programming talk, it means that they cannot be accessed at the same time by more than one thread. This makes the recovery process single-threaded.
SQL Server has a similar transaction log which records all modifications as they are committed, however there is a checkpoint process that runs in the background, persisting data to the data file asynchronously. Even though SQL Server makes extensive use of server memory using the buffer pool, it does not need to load the entire database in there to be functional. This way, if a failure occurs, only active portions of the transaction log need to be read to replay the transactions as they occurred.
Atomic. Consistent. Isolated. Durable. These are the principles guiding data modifications in any data store. We covered durability a little in the previous section, but let’s look at them all as a group.
- Atomicity means that the entire transaction should succeed, or none of it. In other words it either commits, or it should be able to roll back to a point where the transaction never happened.
- Consistency refers to the data types and other rules of the data structures (however they are implemented). If any of the rules are broken, the transaction should roll back. If the rules pass, then the modification must take place accordingly (the transaction is committed).
- Isolation means that when you perform an action on one or more objects, this should not affect anyone else accessing those objects. This is usually implemented by locking the objects until the transaction is committed or rolled back.
- Durability means that a committed transaction is persisted successfully, even if there was a failure elsewhere. This is usually done by writing some sort of log to persisted storage.
In his proposed implementation, Gutteridge displays what I’d call arrogance regarding these principles:
Because the lists are in memory the operations are very fast. They should execute in a few microseconds at most. So this means that we can follow the KISS principle and adopt a really simple synchronization strategy.
Here he admits that locking objects is necessary. His arrogance comes from the fact that relational databases already operate in memory, and also take and release locks (and latches, depending on the type of operation) in microseconds. It’s not that relational databases are worse, but rather that a greater number of concurrent users can cause blocking, no matter the number of “very fast” locks. Performance will indeed be affected in a relational database, however performance will also be affected in his non-relational implementation as well.
And he goes on to make things worse:
A scheme like the above is dead simple. The operation on the access service is done in a single-threaded manner.
Single-threaded locking. Whereas SQL Server has (for example) high concurrency on the
IDENTITY mechanism for adding unique values to a table which in turn allows multiple threads to access a table, Gutteridge’s “dead simple” code does not, by his own admission.
Building our data access model in this way means that all the posts are in memory.
This is where his implementation falls apart. I have looked after the data of many customers, running hundreds of applications, where databases are tens or hundreds of times larger than available system memory. Using properly-crafted indexes (that don’t need to be 20 times larger than the data itself), and just having those indexes in memory when they’re needed, it is possible to have tens of multi-terabyte databases using the same server with 256 GB of RAM. And, unsurprisingly, if SQL Server fails (which is rare), recovery does not take 20 minutes except in very rare cases.
What about corruption?
Gutteridge’s arrogance continues. Here’s the first of two separate quotes:
Despite all the nasty things I’ve done to it during debugging and copying it when the application is still writing, the data store has never corrupted.
How does he know for sure? We have studies showing that memory can be corrupted by cosmic radiation, and unless you have ECC RAM you’ll never know. He has no validation of the data as it is read from or written to disk.
In SQL Server, every data page has checksum verification on by default, to ensure data written and read is valid. Then, additional features like
DBCC CHECKDB and backup checksums are also provided to check for corruption as the result of storage layer problems (bit-flipping in a RAID controller, for example). I don’t know about you, dear reader, but having confidence that the relational database system is actually watching out for my data is a good thing.
Here’s the second, much longer quote, and you’ll be able to see why it is wrong:
Using relational databases has a nasty downside in that relational databases corrupt. Basically while processing a relational database things can happen in the course of operation that cause them to enter an internally contradictory state. Because disk speeds are very low compared to memory, to do data access on a slow block storage device requires some pretty tricky algorithms. You have to store indices on the disk because you can’t spend all the time rebuilding indices. Because of the complicated nature of keeping the data on disk with complicated indices to enable faster access it is possible to get things out of step and have different parts of that disk structure not agreeing with other parts. Then the database is corrupt and when the database software tries to read it in to initialize it crashes.
Gutteridge appears deeply confused by data structures. SQL Server’s buffer pool keeps an exact copy of the data page it reads off the storage layer, and then modifies the data page in memory. Every few seconds these “dirty” pages are written to the data file using a background process called a “checkpoint,” so that they’re persisted. And remember, every data page has a checksum.
The transaction log separately keeps track of data that is being committed, so that if SQL Server fails before a checkpoint runs, the active portion of the log contains everything that was in flight when the failure occurred.
Gutteridge goes off on a tangent about indexes which has nothing to do with disk performance. All data is stored in data pages of different types, whether they are tables, indexes, stored procedures, views, functions, or whatever. All data pages that are modified are done so in memory, and then persisted to disk using the checkpoint operation, with a checksum on each data page. There’s nothing else to it.
Corruption has very little to do with relational databases, and almost everything to do with memory and disk faults. While there were some bugs in SQL Server that have caused corruption in the past, these are patched as soon as possible when found.
Gutteridge’s final mistake (in this opinion piece) is to forget about integrity of data. How does his solution take that into account? I touched on normalisation in my previous essay. There are no constraints in his solution, no way to verify dependencies. There’s no protection against duplication except by unique identifier which is not enforceable. How does he protect against repeating groups of data?
Simple: he does not.
Lance Gutteridge’s proposed solution to replace relational databases is a pale imitation of an actual relational database management system, implementing only half of the things that make an RDBMS useful. Why bother?
I was going to respond to the rest of his opinion pieces in future posts, but again, why bother?
Leave your thoughts in the comments below.