I get questions like this one quite often:
- I’m reading Unicode strings from SQL Server and I need to convert to utf-8. How do I do it? I’ve tried everything and nothing works. I’m using classic ASP.
First, let me assure you, the Chilkat Charset component does work. If your input to ConvertData is correct in the sense that you are passing bytes that represent text according to the character encoding "FromCharset", you will get output bytes in the "ToCharset" encoding. If you pass garbage in, you will get garbage out.
There are 2 steps to solving your problem. The first step is to verify that what you are passing to the charset converter is indeed correct. The charset component provides last-input and last-output properties to allow you to visually inspect the input and output bytes. You’ll first need to turn on the "SaveLast" property. For example:
Dim cconv As New ChilkatCharset2
cconv.UnlockComponent "anything for 30-day trial"
cconv.FromCharset = "utf-8"
cconv.ToCharset = "windows-1252"
cconv.SaveLast = 1
' Run this code 3 times, once for each file...
Dim data As Variant
data = cconv.ReadFile("utf8.txt")
'data = cconv.ReadFile("unicode.txt")
'data = cconv.ReadFile("1252.txt")
Dim out As Variant
out = cconv.ConvertData(data)
Text1.Text = cconv.LastInputAsHex
Let’s look at the contents of LastInputAsHex for each file (utf-8, Unicode, and windows-1252).
Here's the utf-8 data:
EFBB BFC3 8D73 6C65 6E73 6B61 3A20 C389
6720 6765 7420 6574 69C3 B020 676C 6572
20C3 A16E 20C3 BE65 7373 2061 C3B0 206D
6569 C3B0 6120 6D69 672E
Here's the Unicode data:
FFFE CD00 7300 6C00 6500 6E00 7300 6B00
6100 3A00 2000 C900 6700 2000 6700 6500
7400 2000 6500 7400 6900 F000 2000 6700
6C00 6500 7200 2000 E100 6E00 2000 FE00
6500 7300 7300 2000 6100 F000 2000 6D00
6500 6900 F000 6100 2000 6D00 6900 6700
2E00
Here's the Windows-1252 data:
CD73 6C65 6E73 6B61 3A20 C967 2067 6574
2065 7469 F020 676C 6572 20E1 6E20 FE65
7373 2061 F020 6D65 69F0 6120 6D69 672E
Some characteristics of the encodings are:
- If your text is in a European language, Unicode is easily spotted when every (or almost every) 2nd byte is a 0 byte. If there are no NULL bytes, it probably isn’t Unicode
- The FFFE preamble signifies Unicode, the EF BB BF preamble signifies utf-8. It is possible to have Unicode/utf-8 without the preambles.
- Usually, utf-8 text for a European language will be such that non-us-ascii characters are two bytes whereas us-ascii characters are a single byte with the MSB clear (i.e. it’s a 7bit character). The two-byte utf-8 characters for the most part represent the European characters having diacritics (accent marks). The 1st byte will typically have only a few common values, such as 0xC3.
- Windows-125x and iso-8859-x charsets are a single byte/char. Your input byte count would equal the exact number of characters in the text.
So… back to the 1st step of solving your character conversion problem: Does the LastInputAsHex produce output that makes sense with your FromCharset character encoding?
The information I’ve provided should help you answer that question.
The 2nd step is to use the output correctly. You must be sure to avoid implicit conversions in whatever scripting or programming language you’re using, and you have to understand the display capabilities if you’re intending to display the characters in a application’s form or a web page.
The Charset.ConvertData method returns a byte array, or in the case of the ActiveX, a Variant containing a byte array. Avoid assigning this to a "string", because that will most likely involve an implicit ANSI-to-Unicode conversion, and if your output is non-ANSI, your text will be corrupted. If you are writing the output to a file, make sure the file was opened for binary writing. If you are are in ASP and writing the Response, understand how it works in conjunction with the HTML’s charset metatag, ASP code page, etc. More information about that is here: Demystifying ASP Code Pages, Response.Write, Response.BinaryWrite, Strings, and Charsets