A short post this week.
On a mailing list recently, someone noticed that a .NET application writing to SQL Server did not have the expected behaviour with UTF-8 collation and data types.
To refresh our memories, UTF-8 is newly supported in SQL Server 2019, and provides potential savings of up to 50% when storing strings, as opposed to using traditional Unicode data types (which generally means two bytes per character).
In the example on the mailing list, the person created a table on a database with UTF-8 collation, and then using the SqlDbTypes
of Char
and VarChar
, wrote three sample strings containing extended characters to the table. When querying the table again, the three values were returned in ASCII format, so it was clear that Unicode support was ignored.
After confirming this scenario on my own machine (with a different .NET framework and different operating system), I narrowed down the issue to SqlDbType
. When I converted these to NChar
and NVarChar
respectively, the values written to the database were correctly rendered as UTF-8.
The moral of the story is that — as with all strings in .NET applications — they should be treated as Unicode in all scenarios, which means using NChar
and NVarChar
data types when reading from and writing to SQL Server. By leaving UTF-8 handling to the SQL Server engine directly, it’s one less thing to worry about. After all, UTF-8 is ultimately a storage feature.
Share your string conversion stories in the comments below.
Photo by Jason Leung on Unsplash.
Back in May 2019 (the date this blog post is timestamped with), using System.Data.SqlClient was the only option for .NET, and it indeed does not support UTF-8 collations natively, so one has to resort to NChar/NVarChar. Since then Microsoft released a .NET provider that supports UTF-8 collations natively – Microsoft.Data.SqlClient (https://devblogs.microsoft.com/dotnet/introducing-the-new-microsoftdatasqlclient/), as well as updated ODBC and OleDB providers – to support UTF-8.
Another novelty to be aware of – not so much for .NET but for ODBC/OleDB – is that Windows 10 now allows to change the system locale to be UTF-8 as well, which plays well with the updated providers. At the moment of this writing Win10 UTF-8 system locale is marked as Beta, but it is functioning well.
I love how this well-intentioned spirit of sharing leads to nobody understanding anything anymore, since most of what gets shared is actually MIS-understandings.
> The moral of the story is that — as with all strings in .NET applications — they should be treated as Unicode in all scenarios, which means using NChar and NVarChar data types when reading from and writing to SQL Server.
I don’t even know what you think you mean by the claim that “strings should be treated as UTF-8”. All strings in .NET have always been and are still UTF-16, but for almost all intents and purposes, you should treat them as character strings, and not think about encoding (which bytes are used to represent the characters that make up the string) at all.
The nvarchar and nchar and ntext – equivalent to nvarchar(max) – types are also UTF-16. Until recently I believed Unicode had only one character set (hence *uni*, although unicode is of course versioned and the character set does expand with new ones in new versions, such as emojis), but it turns out there is something called “supplementary characters” that are supported, but not enabled by default, even with the latest server (Seattle/2019, https://en.wikipedia.org/wiki/History_of_Microsoft_SQL_Server#Versions).
Anyway, what you did should work fine with older versions of MSSQL as well, since you’re using UTF-16 and not UTF-8. I’ve seen people claim that this is better for perf than using UTF-8 since it provides type affinity (your SQL Server-stored strings have the same binary representation as your client running in the .NET CLR, hence the provider does not have to perform any conversion). I don’t believe it though, unless your text data is mostly non-ascii characters at least, because it leads to much larger rows. Over time, as data volume grows, indexes grow larger and so on, I think using UTF-8 (in which any ascii character plus some more uses only a single byte) will typically perform much better than UTF-16 despite the overhead involved in conversion.
Anyway, you are correct that UTF-8 support is new in MSSQL 2019/Seattle, but you have to use char and varchar (and text) with appropriate *collation* in order to use it, NOT nvarchar, nchar, ntext.
https://docs.microsoft.com/en-us/sql/relational-databases/collations/collation-and-unicode-support?view=sql-server-ver16#utf8
I’m not sure what point you’re trying to make. I said that if you use UTF-8 collation with SQL Server 2019, you should treat all character strings like UTF-16, otherwise they won’t be persisted correctly. That’s all there is to this post. I’m more than happy to speak at length about the internals of collations (I recently presented a session at DataGrillen about data types), but on this post you are reading way too much into what I wrote.
Comments are closed.