Ramblings

Delphi Programming Observations

Friday, July 31, 2009

SQL Function to Get NVARCHAR FROM UTF-8 Stored in VARCHAR

I (am forced to) use StarTeam at work. I think it’s terrible as a Source Code Management system, but I think the Customer Relationship Management aspects (Change Requests, Tasks, Work Records, etc.) work ok. The java GUI client, however, is slow and lacking robust reporting. We certainly have a lot of information stored in StarTeam, so it would take a really good reason to change to another tool.

So, we have a Delphi program to directly query the SQL views in the MS SQL Server database to perform reporting how we want.  Early on, we ran into some strange characters coming out of the database (things like the leaning quotes that MS Word likes to automatically change).  So I found out that StarTeam was storing UTF-8 in the VARCHAR and TEXT fields, basically treating them like byte arrays.  In Delphi 2009, I can just use UTF8ToUnicodeString( RawByteString(queryResultString) ); and the output seems to match what is displayed in StarTeam.

Now I’m learning ASP.NET MVC and using LINQ to SQL to read from the StarTeam SQL views. After some searching, the only information that I found was advice against storing UTF-8 in VARCHAR fields, but of course I can’t change the database.

I didn’t want to have to decode the UTF-8 any time I wanted to read the fields from the Model, and I actually did not find a way to decode them properly in C# using LINQ to SQL.

So in my searching, I came across some JavaScript to change ASCII to UTF-8 and back. I then realized that I might be able to translate that to TSQL, to have my own SQL View return NVARCHARs to LINQ to SQL.

And it seems to work:

CREATE FUNCTION dbo.UTF8_TO_NVARCHAR(@in VarChar(MAX))
	RETURNS NVarChar(MAX)
AS
BEGIN
	DECLARE @out NVarChar(MAX), @i int, @c int, @c2 int, @c3 int, @nc int

	SELECT @i = 1, @out = ''
	
	WHILE (@i <= Len(@in))
	BEGIN
		SET @c = Ascii(SubString(@in, @i, 1))

		IF (@c < 128)
		BEGIN
			SET @nc = @c
			SET @i = @i + 1
		END
		ELSE IF (@c > 191 AND @c < 224)
		BEGIN
			SET @c2 = Ascii(SubString(@in, @i + 1, 1))
			
			SET @nc = (((@c & 31) * 64 /* << 6 */) | (@c2 & 63))
			SET @i = @i + 2
		END
		ELSE
		BEGIN
			SET @c2 = Ascii(SubString(@in, @i + 1, 1))
			SET @c3 = Ascii(SubString(@in, @i + 2, 1))
			
			SET @nc = (((@c & 15) * 4096 /* << 12 */) | ((@c2 & 63) * 64 /* << 6 */) | (@c3 & 63))
			SET @i = @i + 3
		END

		SET @out = @out + NChar(@nc)
	END
	RETURN @out
END
GO

I used VARCHAR(Max) so that this Function will work with TEXT fields, as well as VARCHAR fields. I believe VARCHAR(Max) was introduced in SQL Server 2005; it allows you to have a VARCHAR variable longer than 8000 characters.

posted by Jason at 10:15 pm  

3 Responses to “SQL Function to Get NVARCHAR FROM UTF-8 Stored in VARCHAR”

  1. Massterw says:

    Thanks for your post!!

  2. Wooseung Kim says:

    I’m Korean. Recently a Korean girl archived the gold medal of female single figure skating in Olympic.
    I am proud of it. ;-)

    Can I and my company use and modify your T-SQL funcion for commercial works with license free?
    I have found one just like your function.

    I am waiting for your reply.

    • Jason says:

      Yes, you may use it for commercial works. You can consider it licensed under the MIT license