How SQL Server stores data types: XML

This week we’re looking at how the database engine stores the XML data type in SQL Server and Azure SQL Database.

If you would like to read about storage of other data types, here are the previous posts in the series:

What is XML?

The Extensible Markup Language was developed to make a human-readable and machine-readable text format for encoding documents to transmit between computers, especially over the Internet. While some people may prefer less verbose formats these days (including JSON and YAML), XML is here to stay, and Microsoft built in support for XML as a specific data type for the SQL Server storage engine way back in 2005.

Thanks to a number of formal specifications introduced since 1998, XML documents must contain certain elements within a formalized structure to be recognised as XML. If you’ve ever worked with the HTML format, XML will look similar as they are both derived from the same SGML (Standard Generalized Markup Language) standard, and contain the same components (tags, elements, and attributes).

The key to the XML format is the word “extensible.” Every XML document has an explicit or implicit schema attached to it, which is also in XML format (it’s XML all the way down). This schema provides a set of rules for an XML document describing the types and constraints permitted in each tag, element, and attribute. An XML schema is like a table definition in a relational database, but scoped to a particular column.

XML is useful for semi-structured data, or a data structure that changes all the time and would incur significant costs to modify an equivalent relational structure.

XML as a SQL Server data type

The SQL Server XML data type supports both schema-defined and schema-less XML documents (typed and untyped), and partial documents. Partial documents are still considered well-defined XML, but don’t have surrounding <?xml>...</xml> tags for example. Either way, the storage engine validates the data when inserting or updating into an XML column.

The XML data type lets us store up to 2GB per column per row, but it’s prudent here to note that just because we can, it doesn’t mean we should.

Let’s get some valid XML using the FOR XML Transact-SQL syntax built into SQL Server. For example, I can run this query on my Test database:

SELECT
       name,
       object_id,
       schema_id,
       type,
       type_desc,
       create_date,
       modify_date
FROM sys.tables
FOR XML PATH;

The resulting XML looks like this:

<row>
  <name>Test</name>
  <object_id>565577053</object_id>
  <schema_id>1</schema_id>
  <type>U </type>
  <type_desc>USER_TABLE</type_desc>
  <create_date>2019-11-15T23:05:58.353</create_date>
  <modify_date>2019-11-17T14:54:30.243</modify_date>
</row>

If we remove the line breaks and spaces this output is 232 bytes long in UTF-8 format, or 464 bytes in UTF-16 format (two bytes per character).

After we insert this data into an XML column, we can look at the underlying data on the data page:

ColData = [BLOB Inline Data] Slot 0 Column 2 Offset 0xf Length 355 Length (physical) 355

000000BF135EE8B0:   dfff01b0 04f00372 006f0077 00ef0000 01f801f0  ßÿ.°.ð.r.o.w.ï...ø.ð
000000BF135EE8C4:   046e0061 006d0065 00ef0000 02f80211 04540065  .n.a.m.e.ï...ø...T.e
000000BF135EE8D8:   00730074 00f7f009 6f006200 6a006500 63007400  .s.t.÷ð   o.b.j.e.c.t.
000000BF135EE8EC:   5f006900 6400ef00 0003f803 11093500 36003500  _.i.d.ï...ø.. 5.6.5.
000000BF135EE900:   35003700 37003000 35003300 f7f00973 00630068  5.7.7.0.5.3.÷ð    s.c.h
000000BF135EE914:   0065006d 0061005f 00690064 00ef0000 04f80411  .e.m.a._.i.d.ï...ø..
000000BF135EE928:   013100f7 f0047400 79007000 6500ef00 0005f805  .1.÷ð.t.y.p.e.ï...ø.
000000BF135EE93C:   11025500 2000f7f0 09740079 00700065 005f0064  ..U. .÷ð  t.y.p.e._.d
000000BF135EE950:   00650073 006300ef 000006f8 06110a55 00530045  .e.s.c.ï...ø...U.S.E
000000BF135EE964:   0052005f 00540041 0042004c 004500f7 f00b6300  .R._.T.A.B.L.E.÷ð.c.
000000BF135EE978:   72006500 61007400 65005f00 64006100 74006500  r.e.a.t.e._.d.a.t.e.
000000BF135EE98C:   ef000007 f8071117 32003000 31003900 2d003100  ï...ø...2.0.1.9.-.1.
000000BF135EE9A0:   31002d00 31003500 54003200 33003a00 30003500  1.-.1.5.T.2.3.:.0.5.
000000BF135EE9B4:   3a003500 38002e00 33003500 3300f7f0 0b6d006f  :.5.8...3.5.3.÷ð.m.o
000000BF135EE9C8:   00640069 00660079 005f0064 00610074 006500ef  .d.i.f.y._.d.a.t.e.ï
000000BF135EE9DC:   000008f8 08111732 00300031 0039002d 00310031  ...ø...2.0.1.9.-.1.1
000000BF135EE9F0:   002d0031 00370054 00310034 003a0035 0034003a  .-.1.7.T.1.4.:.5.4.:
000000BF135EEA04:   00330030 002e0032 00340033 00f7f7             .3.0...2.4.3.÷÷

The first thing you might notice is that you can read the data as clear text, but with a space after each character. XML is by default a Unicode format, and the data is stored as an inline blob in what appears to be UTF-16 (two bytes per character, like NVARCHAR).

The second thing you might notice is the Length, which is 355 bytes. SQL Server is taking some shortcuts by not storing the angle-brackets as characters (along with other things). If we associate a schema with the XML column (which Microsoft calls typed), there will be some additional shortcuts that the database engine takes in how it stores that data. In other words, in most cases the XML data type takes up less storage space than NVARCHAR.

Unfortunately that means that XML data is not immutable. If we want to store XML for auditing purposes where the data cannot be modified, the XML data type is the wrong data type to use and we should instead use NVARCHAR(MAX), VARCHAR(MAX) with a UTF-8 collation, or VARBINARY(MAX) using the COMPRESS and DECOMPRESS functions.

Leave your thoughts in the comments below.

Photo by Philipp Katzenberger on Unsplash.