Bulk insert issue with UTF-8 fixed-width file format

stacks of paper

Fellow Canadian Doran Douglas brought this issue to my attention recently, and I wanted to share it with you as well.

Let’s say you have a file in UTF-8 format. What this means is that some of the characters will be single-byte, and some may be more than that.

Where this becomes problematic is that a fixed-width file has fields that are, well, fixed in size. If a Unicode character requires more than one byte, it’s going to cry havoc and let slip the dogs of truncation.

Take a look at this sample:

Notice that the first field (the Country name) is a fixed 24 characters, and the two code fields are three characters in length each.

Also note that the first and third lines have Unicode characters: the ç and the Å specifically.

Here’s what happens if you use BULK INSERT or the bcp command line tool:

The fixed-width parser has been told that the Country name is 24 characters wide, so it brings in the eight bytes of Curaçao followed by 16 bytes of empty string (the ç takes up two bytes because it is Unicode).

Then the next field of three characters is brought in, comprising an empty space, and 53. Finally, the third field is brought in as 1CU. The parser knows there is a line break, so it imports the W into its own field as Country name, and nothing in the next two fields. Following another line break, Canada is imported successfully, along with its two code fields.

In the case of Åland Islands, the problem repeats itself and only portions of the next two fields are imported correctly, with the final A ending up on its own line. The final line is imported successfully again.

Depending on the rules for import, this is the best case scenario if the source file is fixed-width and cannot be modified. Other variations result in much worse imports.

So how do we fix this?

The first and best option is to have the source file correctly formatted. That is, if there are Unicode characters in only some of the lines, the file generator should compensate by adding or removing trailing spaces accordingly, if it can.

In Doran’s case, this was not possible because he had no control over the source file. Instead, he and I independently came up with the solution to import the entire file into a single column in SQL Server using the bulk loader, and then parse the fields based on length using Transact-SQL.

Here are three different solutions that I came up with. Because I tried three different ways, I used a staging table:

Doran’s own solution was similar to my first solution using SUBSTRING(), but without the staging table.

Have you been bitten by Unicode/UTF-8? Tell me your horror stories and successes here in the comments.

Photo by JJ Ying on Unsplash.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: