During routine maintenance on a customer’s production server, I discovered that they have one table consuming 40% of the storage in their database. That table contains just under 10 million rows, which isn’t that remarkable; another table in the same database has almost 500 million rows.
The remarkable thing — because you read the subject of this post and have figured it out — is that they’re storing JSON data in that table. Not only that, but there are two columns containing between 2KB and 5KB each with over 10 million rows. In other words, almost 100GB of storage is being used by two JSON columns. If you’ve done some mental arithmetic, you’ll also realize that with an average of 10KB per row (and SQL Server having a data page size of 8KB) there’s a lot of off-row nonsense going on here.
I’ll also pre-emptively note that if this table was simply an append-only archive table, the row size would not really matter. Unfortunately, this table participates in thousands of transactions per day, and as the original developers used Entity Framework and didn’t think much of using NVARCHAR(MAX)
, the entire row is coming over the wire into the application each time it is queried.
As I’ve written previously about this kind of thing, this is not a good design pattern. Using the VARBINARY(MAX)
data type with COMPRESS
in the INSERT
/UPDATE
queries — and DECOMPRESS
in the SELECT
queries — is a much better design pattern and dramatically reduces the amount of data transferred over the network. Additionally, SQL Server needs significantly less space to store, maintain, and back up this compressed data.
In my experience with plain text, I can save between 80% to 99% of storage depending on the data and compression used. Using a sample of 1000 rows in this table, I estimate that we can save as much as 80GB of space — probably even more — by switching from NVARCHAR(MAX)
to VARBINARY(MAX)
with COMPRESS
and DECOMPRESS
.
I know I’ve spoken about this several times before, but I keep seeing this over and over again. If you must store unstructured JSON or XML in your database, please compress it. The CPU overhead is marginal compared to the amount of CPU and other resources required to maintain and manage that additional storage.
Imagine: this customer currently has a little under 100GB free on their storage subsystem. By taking the above changes into account, three months from now it will be as much as double that, and backups will consume less space. The entire database will fit comfortably in the buffer pool, and you know what that means for performance.
Share your thoughts in the comments below.
Sounds like someone needs an Introduction to the SQLAPI in CosmosDB
Oh man, CosmosDB is a better option for JSON, sure, but some folks are married to keeping everything in one place, so compress it!
You wrote “there’s a lot of off-row nonsense going on here.”
While I do agree that doing things lie storing JSON, XML, and whatever just bug the heck out of me for the same reasons as you, I’m also bugged by things like virtually anything larger than a VARCHAR(200). Error messages, comments, whatever. They all have the same problem… They live “In-Row” and they’re probably never going to be indexed.
As you mention, one way fix for those issues is to COMPRESS the columns. Ah, but that isn’t available for anything prior to 2016 and, like it or not, clients are going to have some pre-2016 databases that they simply cannot change. They might even have some 2016+ database that also cannot be changed simply because it would cost way too much to make the changes and then regression check things.
If you do have columns like a VARCHAR(2000) or more, you’re going to end up with a problem that I call “Trapped Short Rows”. These are sort rows of data that actually leave a lot of freespace in a page (especially for more than VARCHAR(4000) or NVARCHAR(2000)) that will never, ever be used again and cannot be recovered by the use of {gasp1} REORGANIZE of REBUILD in an ALTER INDEX. The presence of “Trapped Short Rows” a huge waste of disk space and memory.
It’s also creates another performance/resource related issue. If most of your lookups in the table are single row lookups and a scan occurs, you have to read all those large variable with columns instead of the usually narrow search criteria. It has to load the whole bloody table in memory during the scan possibly (usually) knocking much more valuable things out of memory. If you have to read a range of rows using a seek followed by a range scan and you’re not actually going to use the bloody comments column (or whatever), then you could be reading just 1 or 2 rows per page instead of perhaps hundreds and we all now about how that affects performance and memory usage. It doesn’t make for a comfy “Martha Stuart Moment”.
The fix for THAT nonsense is to put to very good use the “nonsense” you spoke of. Even though they do fit “In-Row”, change them to one of the MAX datatypes, give those columns a single space or empty string default (the out of row pointer is “ExpAnsive” and so you need every row to have a default so the pointer pre-materializes), set the table option to out of row for the large datatypes, do an in-place update of all the existing data in those columns to get the existing data to move, and then rebuild the Clustered Index to recover all the extra space that has now appeared in your Clustered Index.
Then watch as the queries that don’t actually return any of those longer columns are suddenly MUCH faster and the need for one, two, or a few of your non-clustered “covering” indexes has simply vanished which is an additional space savings.
Because it’s unstructured data that’s held, you can argue that you need Cosmos or other database engine rather than being “married” to SQL Server but some people just don’t believe in, want to, or even need to be a database polygamist. They just need to be shown another reason to love the thing they’re married to. 😉
A simple “trick” can frequently work just as well (and sometime a whole lot better) as external solutions and be a whole lot less expensive than bringing yet another “engine” and “language” into the house to learn well enough to use and support.
To wit, bringing something lie Cosmos into the house to support this simple problem would be the real “nonsense” to me. 😉
p.s. Heh… speaking of “blobs”, I absolutely hate the response software on this site. It removed all my attempts at paragraph separation. I’m amazed that this type of “nonsense” still occurs is the day and age.
The terminology you guys use is confusing. First of all, the very fact the data being stored is JSON or XML, it is _structured_, not _unstructured_. The point of using the format is that it’s structured. If you mean _uncompressed_ or _text_, then those are the words that should be used. Secondly, the author said “the original developers used Entity Framework and didn’t think much of using NVARCHAR(MAX)”, then later on uses the example of an NVARCHAR(MAX) column. Obviously they did think of using NVARCHAR(MAX). Perhaps I need to have my memory refreshed, but the declaration of NVARCHAR means the text stored is of _variable_ length and the database uses only the amount of storage it needs for the text. So all of this “nonsense” is contradictory. Would somebody care to untangle this mess to improve my understanding of what it is you all were trying to say here?
Hi Ed, thanks for your comment.
XML and JSON are unstructured in the context of a relational database like SQL Server. When I say structured data, I am referring explicitly to the tables and columns of the relational database, as well as implicitly referring to referential integrity between those tables with foreign keys. Data stored in an XML column is unstructured because there is no referential integrity. Each “document” in that column is allowed to have different attributes between them. Applying an XML schema on top of an XML column adds another layer of abstraction. If all you’re doing is storing unstructured data with a schema, there are possibly better choices than SQL Server.
On the second question, there is a tendency for software developers who don’t understand SQL Server to use NVARCHAR(MAX) for underlying data types because it’s “variable length” and matches the Unicode of strings in application code like C#, but it’s UTF-16 so it needs at least two bytes per character, even if the data is in ASCII format. Additionally, the storage engine makes different decisions around indexing (you can’t use a MAX column in an index, only as an included column), storage (data will be pushed off-row making it slower to read and modify), and memory (the optimizer will make overly-large memory grants because of the data type), to name just a handful of problems.
Hopefully this clears up some of the questions you’ve raised. I don’t have a problem with VARCHAR and NVARCHAR generally, and as long as a table design places reasonable limits on these columns, it makes for better performance and database maintenance.
Comments are closed.