T-SQL Tuesday Retrospective #006: What about blob?

I am revisiting old T-SQL Tuesday invitations from the very beginning of the project.

On May 3, 2010, Michael Coles invited us to write about how we use LOB data, so now you know what this week’s post is about.

Let’s go over some definitions and complain a little, because that’s how cranky data professionals like me roll. In SQL Server, the term “LOB” stands for “large object” and is defined as data that does not fit on a single row, or is more than 8,000 bytes.

There is another term — “blob” — which I refuse as a matter of principle to capitalize, because I do not acknowledge the backronym that was applied to it. When Jim Starkey at DEC invented the word, he meant it in terms of the film “The Blob.” In the world of data, a blob is a chunk of unstructured data. Some marketing person called Terry decided to make it into the acronym “basic large object” which doesn’t even make sense to me. Later still, Informix decided to make “basic” into “binary” and I just can’t get over it. A blob is a blob. Gretchen, stop trying to make BLOB happen.

A blob might be a text file, plain XML or JSON, a Word document, an image, or a zip file. Informix wants you to think the “B” means “binary,” but all data is binary if you think about it.

When we look at the several data types for storing LOB data in SQL Server, the ones we might think of first are specifically designed for storing clear text. These are VARCHAR(MAX) and NVARCHAR(MAX), which you can read up about on Microsoft Docs. They’re variable-length data types that are used to store clear text between 8,000 bytes and 2 GB. The key distinction is that NVARCHAR is for Unicode strings, which use at least two bytes per character.

Then we get VARBINARY(MAX), also for storing between 8,000 bytes and 2 GB. This data type does not make any assumptions about the stored blob’s format. You can store text, zip files, images, public key security certificates, you name it. If it can be persisted to storage, it can be stored in VARBINARY(MAX).

XML — which I’ve also covered on this blog — is one of the unsung heroes of SQL Server data types because it supports schema definitions and XML queries, along with other exciting uses. However, it does slightly modify the original data when you insert it (usually just with spacing), so take care.

Older data types which we no longer recommend are TEXT and IMAGE, which are roughly equivalent to (but not the same as) VARCHAR and VARBINARY respectively. Don’t use them.

Of course, this is the part of the blog post where the data architect in me suggests that if you’re storing large reams of unstructured data in your database, you might want to consider a different approach using the file system instead.

So, with Michael asking how we use LOB data, I will discuss a production database design where I made extensive use of them with VARBINARY(MAX) in combination with the COMPRESS and DECOMPRESS system functions in Transact-SQL.

Every single call to a third-party API as well as its response needed to be recorded for auditing purposes. The requirement was that the XML could not be modified, so we had to rule out the XML data type immediately. Secondly, these calls are rather large (sometimes in the tens of kilobytes), but because they are mostly plain text they compress to between 2% and 10% of their original size.

When inserting the data using a stored procedure, a hash is calculated and then it is compressed into a VARBINARY(MAX) data type using the COMPRESS function. Both the data and the hash are inserted into a table which itself is audited. When we need to read back the data, we use a stored procedure again to DECOMPRESS the VARBINARY(MAX) data into plain text and validate the hash. The resulting plain text can be parsed using a custom XML tool.

This allowed us to track every single API call and response, ensure the data is unmodified, and still save at least 90% of storage versus the size of the original calls. Because reads from the table are limited to very specific scenarios, we don’t need to apply any indexes on the XML data, and after a set period of time the data is archived in a read-only manner which maintains its integrity.

I hope you’re enjoying this series. Share your thoughts in the comments below.

Photo by Michael Dziedzic on Unsplash.