Bulk insert issue with UTF-8 fixed-width file format

Fellow Canadian Doran Douglas brought this issue to my attention recently, and I wanted to share it with you as well.

Let’s say you have a file in UTF-8 format. What this means is that some of the characters will be single-byte, and some may be more than that.

Where this becomes problematic is that a fixed-width file has fields that are, well, fixed in size. If a Unicode character requires more than one byte, it’s going to cry havoc and let slip the dogs of truncation.

Take a look at this sample:

Curaçao                 531CUW
Canada                  124CAN
Åland Islands           248ALA
United States of America840USA

Notice that the first field (the Country name) is a fixed 24 characters, and the two code fields are three characters in length each.

Also note that the first and third lines have Unicode characters: the ç and the Å specifically.

Here’s what happens if you use BULK INSERT or the bcp command line tool:

The fixed-width parser has been told that the Country name is 24 characters wide, so it brings in the eight bytes of Curaçao followed by 16 bytes of empty string (the ç takes up two bytes because it is Unicode).

Then the next field of three characters is brought in, comprising an empty space, and 53. Finally, the third field is brought in as 1CU. The parser knows there is a line break, so it imports the W into its own field as Country name, and nothing in the next two fields. Following another line break, Canada is imported successfully, along with its two code fields.

In the case of Åland Islands, the problem repeats itself and only portions of the next two fields are imported correctly, with the final A ending up on its own line. The final line is imported successfully again.

Curaçao                  |  53 | 1CU
W                        |     |
Canada                   | 124 | CAN
Åland Islands            |  24 | 8AL
A                        |     |
United States of America | 840 | USA

Depending on the rules for import, this is the best case scenario if the source file is fixed-width and cannot be modified. Other variations result in much worse imports.

So how do we fix this?

The first and best option is to have the source file correctly formatted. That is, if there are Unicode characters in only some of the lines, the file generator should compensate by adding or removing trailing spaces accordingly, if it can.

In Doran’s case, this was not possible because he had no control over the source file. Instead, he and I independently came up with the solution to import the entire file into a single column in SQL Server using the bulk loader, and then parse the fields based on length using Transact-SQL.

Here are three different solutions that I came up with. Because I tried three different ways, I used a staging table:

USE tempdb;

DROP TABLE IF EXISTS CountryStaging;

-- Single column for all three imported fields
CREATE TABLE CountryStaging (Line NVARCHAR(512));

-- Into staging table
BULK INSERT CountryStaging
FROM 'C:\Temp\CountryTestDataFixedWidth.txt'
WITH (ROWTERMINATOR = '\n');
GO

-- First method: SELECT with SUBSTRING()
SELECT
    SUBSTRING([Line], 0, 25) AS CountryName,
    SUBSTRING([Line], 25, 3) AS CountryCodeNumeric,
    SUBSTRING([Line], 28, 3) AS CountryCodeAlpha
FROM
    dbo.CountryStaging;

-- Second method: SELECT with LEFT and RIGHT
SELECT
    CAST(LEFT([Line], 24) AS NVARCHAR(255)) AS CountryName,
    CAST(LEFT(RIGHT([Line], (6)), (3)) AS CHAR(3)) AS CountryCodeNumeric,
    CAST(RIGHT([Line], (3)) AS CHAR(3)) AS CountryCodeAlpha
FROM
    dbo.CountryStaging;

-- Third method: Add in persisted computed columns
ALTER TABLE dbo.CountryStaging ADD CountryName AS (LEFT([Line], 24)) PERSISTED;
ALTER TABLE dbo.CountryStaging ADD CountryCodeNumeric AS (LEFT(RIGHT([Line], (6)), (3))) PERSISTED;
ALTER TABLE dbo.CountryStaging ADD CountryCodeAlpha AS (RIGHT([Line], (3))) PERSISTED;
GO

-- SELECT with computed columns
SELECT
    CountryName,
    CountryCodeNumeric,
    CountryCodeAlpha
FROM
    CountryStaging;

DROP TABLE IF EXISTS CountryStaging;

Doran’s own solution was similar to my first solution using SUBSTRING(), but without the staging table.

Have you been bitten by Unicode/UTF-8? Tell me your horror stories and successes here in the comments.

Photo by JJ Ying on Unsplash.