October 30, 2006

Displaying Characters from any Language in ASP

Download this ASP example and text data files from:
http://www.chilkatsoft.com/download/aspUtf8.zip

response should be characters in the utf-8 encoding.

Utf-8, the multibyte encoding for Unicode, is the de-facto standard
charset for Unicode web pages. Pages that may need to display
characters in any language will use utf-8. (Such as a major search
engine results page.)

The problem can be broken up into these parts:

Where is your text data? Are you querying Unicode text from a database?
Are you reading ANSI text files? How do you get the data from its source
into string variables that can be manipulated in ASP?
Once the data is properly read from its source, how do you emit utf-8 bytes
to the response output?

Chilkat provides components that help in both steps. ChilkatUtil.dll is free, ChilkatCharset2.dll is not

‘ This example uses three text files:
‘ text_unicode.txt — contains 2 byte/char Unicode characters.
‘ text_utf8.txt — contains utf-8 encoded text.
‘ text_windows1252.txt — contains ANSI (1 byte/char) text (Windows-1252 encoding)
‘
‘ Your data may come from a database. It doesn’t matter. The point here is
‘ to show how to analyze the bytes once you’ve loaded your data to understand
‘ what you have.

‘ ASP provides the ability to read text files, but not binary files.
‘ What is text and what is binary? Technically, text_unicode.txt and text_utf8.txt are
‘ "binary", and text_windows1252.txt is "text".

‘ Let’s read each of them with the FileSystemObject anyway to see what
‘ we get. Do the bytes get mangled when reading as text? We’ll see…
Set fs=Server.CreateObject("Scripting.FileSystemObject")
Set f=fs.OpenTextFile(Server.MapPath("/data/text_unicode.txt"), 1)
txtUnicode = f.ReadAll
f.Close
Set f=fs.OpenTextFile(Server.MapPath("/data/text_utf8.txt"), 1)
txtUtf8 = f.ReadAll
f.Close
Set f=fs.OpenTextFile(Server.MapPath("/data/text_windows1252.txt"), 1)
txtAnsi = f.ReadAll
f.Close

‘ Use ChilkatUtil.CkData to examine the bytes (by calling GetHex to get
‘ a hex-encoded string of the raw byte data).
‘ The ChilkatUtil.dll is included with most downloads, including
‘ the Chilkat Charset ActiveX (http://www.chilkatsoft.com/download/CharsetActiveX.msi)
‘ IMPORTANT: Download the latest version of the Chilkat Charset ActiveX because
‘ the ChilkatUtil.CkData.GetHex method was added on 30-Oct-2006 for this example.
Set ckd = Server.CreateObject("ChilkatUtil.CkData")
ckd.LoadBinary txtUnicode
Response.Write "incorrect: " + ckd.GetHex() + "
"
‘ Display this: FF00FE00CD000000730000006C000000650000006E00000073000000…
‘ This is not good. We’ll have to change the way the data is read from the file.
‘ OpenTextFile/ReadAll is treating each individual byte as an ANSI char and
‘ converting it to 2-byte/char Unicode. Even the preamble bytes (FF FE) are assumed
‘ to be ANSI chars.

ckd.LoadBinary txtUtf8
Response.Write "incorrect: " + ckd.GetHex() + "
"
‘ Displays this: EF00BB00BF00C3008D0073006C0065006E0073006B0061002000
‘ The same problem happens here. Each byte is treated by OpenTextFile/ReadAll as an ASNI char,
‘ including the utf-8 preamble bytes (EF BB BF). The us-ascii bytes are correctly converted
‘ to Unicode. However, then 8bit non-us-ascii characters (i.e. the characters with diacritics,
‘ i.e. accented European language characters such as acute, grave, umlaut, etc.) are represented
‘ in utf-8 byte 2 bytes/char, and in this case each byte is incorrectly assumed to be a single
‘ ANSI character and converted to Unicode. For example, 0xC3 0×8D represent a single character
‘ represented by 0xCD in Windows-1252. Our code has incorrectly converted 0xC3 0×8D into two
‘ Unicode characters.

ckd.LoadBinary txtAnsi
Response.Write "correct: " + ckd.GetHex() + "
"
‘ Displays this: CD0073006C0065006E0073006B00610020002F0020004900
‘ This is OK. We can see that each Unicode character is represented correctly in the hex string,
‘ and the length of the hex string makes sense.
‘ Our experiment shows that you can only read ANSI text files with OpenTextFile/ReadAll.

‘ If you are fetching text data from a database, or from some other source, your first
‘ step should be to examine the bytes. Did the bytes get mucked up in the process?
‘ Use the same procedure as described here: load the data into a CkData by calling
‘ LoadBinary and then examine the bytes with GetHex.

‘ How would we get the text from text_unicode.txt and text_utf8.txt into an ASP string?
‘ Use ChilkatCharset2 to read and convert:
set cc = Server.CreateObject("ChilkatCharset2.ChilkatCharset2″)
cc.UnlockComponent "anything for 30-day trial"

fileData = cc.ReadFile(Server.MapPath("/data/text_utf8.txt"))
cc.FromCharset = "utf-8″
textUtf8 = cc.ConvertToUnicode(fileData)
ckd.LoadBinary textUtf8
Response.Write "correct: " + ckd.GetHex() + "
"

‘ The conversion from Unicode to Unicode is simply removing the preamble here…
fileData = cc.ReadFile(Server.MapPath("/data/text_unicode.txt"))
cc.FromCharset = "unicode"
textUnicode = cc.ConvertToUnicode(fileData)
ckd.LoadBinary textUnicode
Response.Write "correct: " + ckd.GetHex() + "
"

‘ Now that we have the input loaded into a string correctly, how do we emit
‘ utf-8 bytes?
Response.Write "

Utf-8 Text:
"
‘ Use BinaryWrite to emit utf-8 bytes
cc.ToCharset = "utf-8″
Response.BinaryWrite cc.ConvertFromUnicode(textUnicode)
Response.Write "

"
%>

Privacy Statement. Copyright 2000-2011 Chilkat Software, Inc. All rights reserved.
Send feedback to support@chilkatsoft.com
Components for Microsoft Windows XP, 2000, 2003 Server, Vista, Windows 7, and Windows 95/98/NT4.