During routine maintenance on a customer’s production server, I discovered that they have one table consuming 40% of the storage in their database. That table contains just under 10 million rows, which isn’t that remarkable; another table in the same database has almost 500 million rows.
The remarkable thing — because you read the subject of this post and have figured it out — is that they’re storing JSON data in that table. Not only that, but there are two columns containing between 2KB and 5KB each with over 10 million rows. In other words, almost 100GB of storage is being used by two JSON columns. If you’ve done some mental arithmetic, you’ll also realize that with an average of 10KB per row (and SQL Server having a data page size of 8KB) there’s a lot of off-row nonsense going on here.
I’ll also pre-emptively note that if this table was simply an append-only archive table, the row size would not really matter. Unfortunately, this table participates in thousands of transactions per day, and as the original developers used Entity Framework and didn’t think much of using
NVARCHAR(MAX), the entire row is coming over the wire into the application each time it is queried.
As I’ve written previously about this kind of thing, this is not a good design pattern. Using the
VARBINARY(MAX) data type with
COMPRESS in the
UPDATE queries — and
DECOMPRESS in the
SELECT queries — is a much better design pattern and dramatically reduces the amount of data transferred over the network. Additionally, SQL Server needs significantly less space to store, maintain, and back up this compressed data.
In my experience with plain text, I can save between 80% to 99% of storage depending on the data and compression used. Using a sample of 1000 rows in this table, I estimate that we can save as much as 80GB of space — probably even more — by switching from
I know I’ve spoken about this several times before, but I keep seeing this over and over again. If you must store unstructured JSON or XML in your database, please compress it. The CPU overhead is marginal compared to the amount of CPU and other resources required to maintain and manage that additional storage.
Imagine: this customer currently has a little under 100GB free on their storage subsystem. By taking the above changes into account, three months from now it will be as much as double that, and backups will consume less space. The entire database will fit comfortably in the buffer pool, and you know what that means for performance.
Share your thoughts in the comments below.