How SQL Server stores data types: XML

A half-closed MacBook

This week we’re looking at how the database engine stores the XML data type in SQL Server and Azure SQL Database.

If you would like to read about storage of other data types, here are the previous posts in the series:

What is XML?

The Extensible Markup Language was developed to make a human-readable and machine-readable text format for encoding documents to transmit between computers, especially over the Internet. While some people may prefer less verbose formats these days (including JSON and YAML), XML is here to stay, and Microsoft built in support for XML as a specific data type for the SQL Server storage engine way back in 2005.

Thanks to a number of formal specifications introduced since 1998, XML documents must contain certain elements within a formalized structure to be recognised as XML. If you’ve ever worked with the HTML format, XML will look similar as they are both derived from the same SGML (Standard Generalized Markup Language) standard, and contain the same components (tags, elements, and attributes).

The key to the XML format is the word “extensible.” Every XML document has an explicit or implicit schema attached to it, which is also in XML format (it’s XML all the way down). This schema provides a set of rules for an XML document describing the types and constraints permitted in each tag, element, and attribute. An XML schema is like a table definition in a relational database, but scoped to a particular column.

XML is useful for semi-structured data, or a data structure that changes all the time and would incur significant costs to modify an equivalent relational structure.

XML as a SQL Server data type

The SQL Server XML data type supports both schema-defined and schema-less XML documents (typed and untyped), and partial documents. Partial documents are still considered well-defined XML, but don’t have surrounding <?xml>...</xml> tags for example. Either way, the storage engine validates the data when inserting or updating into an XML column.

The XML data type lets us store up to 2GB per column per row, but it’s prudent here to note that just because we can, it doesn’t mean we should.

Let’s get some valid XML using the FOR XML Transact-SQL syntax built into SQL Server. For example, I can run this query on my Test database:

The resulting XML looks like this:

If we remove the line breaks and spaces this output is 232 bytes long in UTF-8 format, or 464 bytes in UTF-16 format (two bytes per character).

After we insert this data into an XML column, we can look at the underlying data on the data page:

The first thing you might notice is that you can read the data as clear text, but with a space after each character. XML is by default a Unicode format, and the data is stored as an inline blob in what appears to be UTF-16 (two bytes per character, like NVARCHAR).

The second thing you might notice is the Length, which is 355 bytes. SQL Server is taking some shortcuts by not storing the angle-brackets as characters (along with other things). If we associate a schema with the XML column (which Microsoft calls typed), there will be some additional shortcuts that the database engine takes in how it stores that data. In other words, in most cases the XML data type takes up less storage space than NVARCHAR.

Unfortunately that means that XML data is not immutable. If we want to store XML for auditing purposes where the data cannot be modified, the XML data type is the wrong data type to use and we should instead use NVARCHAR(MAX), VARCHAR(MAX) with a UTF-8 collation, or VARBINARY(MAX) using the COMPRESS and DECOMPRESS functions.

Leave your thoughts in the comments below.

Photo by Philipp Katzenberger on Unsplash.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: