Fellow Canadian Doran Douglas brought this issue to my attention recently, and I wanted to share it with you as well.
Let’s say you have a file in UTF-8 format. What this means is that some of the characters will be single-byte, and some may be more than that.
Where this becomes problematic is that a fixed-width file has fields that are, well, fixed in size. If a Unicode character requires more than one byte, it’s going to cry havoc and let slip the dogs of truncation.
Take a look at this sample:
Åland Islands 248ALA
United States of America840USA
Notice that the first field (the Country name) is a fixed 24 characters, and the two code fields are three characters in length each.
Also note that the first and third lines have Unicode characters: the
ç and the
Here’s what happens if you use
BULK INSERT or the
bcp command line tool:
The fixed-width parser has been told that the Country name is 24 characters wide, so it brings in the eight bytes of
Curaçao followed by 16 bytes of empty string (the
ç takes up two bytes because it is Unicode).
Then the next field of three characters is brought in, comprising an empty space, and
53. Finally, the third field is brought in as
1CU. The parser knows there is a line break, so it imports the
W into its own field as Country name, and nothing in the next two fields. Following another line break, Canada is imported successfully, along with its two code fields.
In the case of
Åland Islands, the problem repeats itself and only portions of the next two fields are imported correctly, with the final
A ending up on its own line. The final line is imported successfully again.
Curaçao | 53 | 1CU
W | |
Canada | 124 | CAN
Åland Islands | 24 | 8AL
A | |
United States of America | 840 | USA
Depending on the rules for import, this is the best case scenario if the source file is fixed-width and cannot be modified. Other variations result in much worse imports.
So how do we fix this?
The first and best option is to have the source file correctly formatted. That is, if there are Unicode characters in only some of the lines, the file generator should compensate by adding or removing trailing spaces accordingly, if it can.
In Doran’s case, this was not possible because he had no control over the source file. Instead, he and I independently came up with the solution to import the entire file into a single column in SQL Server using the bulk loader, and then parse the fields based on length using Transact-SQL.
Here are three different solutions that I came up with. Because I tried three different ways, I used a staging table:
DROP TABLE IF EXISTS CountryStaging;
-- Single column for all three imported fields
CREATE TABLE CountryStaging (Line NVARCHAR(512));
-- Into staging table
BULK INSERT CountryStaging
WITH (ROWTERMINATOR = '\n');
-- First method: SELECT with SUBSTRING()
SUBSTRING([Line], 0, 25) AS CountryName,
SUBSTRING([Line], 25, 3) AS CountryCodeNumeric,
SUBSTRING([Line], 28, 3) AS CountryCodeAlpha
-- Second method: SELECT with LEFT and RIGHT
CAST(LEFT([Line], 24) AS NVARCHAR(255)) AS CountryName,
CAST(LEFT(RIGHT([Line], (6)), (3)) AS CHAR(3)) AS CountryCodeNumeric,
CAST(RIGHT([Line], (3)) AS CHAR(3)) AS CountryCodeAlpha
-- Third method: Add in persisted computed columns
ALTER TABLE dbo.CountryStaging ADD CountryName AS (LEFT([Line], 24)) PERSISTED;
ALTER TABLE dbo.CountryStaging ADD CountryCodeNumeric AS (LEFT(RIGHT([Line], (6)), (3))) PERSISTED;
ALTER TABLE dbo.CountryStaging ADD CountryCodeAlpha AS (RIGHT([Line], (3))) PERSISTED;
-- SELECT with computed columns
DROP TABLE IF EXISTS CountryStaging;
Doran’s own solution was similar to my first solution using
SUBSTRING(), but without the staging table.
Have you been bitten by Unicode/UTF-8? Tell me your horror stories and successes here in the comments.