Download this ASP example and text data files from:
http://www.chilkatsoft.com/download/aspUtf8.zip
response should be characters in the utf-8 encoding.
Utf-8, the multibyte encoding for Unicode, is the de-facto standard
charset for Unicode web pages. Pages that may need to display
characters in any language will use utf-8. (Such as a major search
engine results page.)
The problem can be broken up into these parts:
- Where is your text data? Are you querying Unicode text from a database?
Are you reading ANSI text files? How do you get the data from its source
into string variables that can be manipulated in ASP?
- Once the data is properly read from its source, how do you emit utf-8 bytes
to the response output?
Chilkat provides components that help in both steps. ChilkatUtil.dll is free, ChilkatCharset2.dll is not
<%
‘ This example uses three text files:
‘ text_unicode.txt — contains 2 byte/char Unicode characters.
‘ text_utf8.txt — contains utf-8 encoded text.
‘ text_windows1252.txt — contains ANSI (1 byte/char) text (Windows-1252 encoding)
‘
‘ Your data may come from a database. It doesn’t matter. The point here is
‘ to show how to analyze the bytes once you’ve loaded your data to understand
‘ what you have.
‘ ASP provides the ability to read text files, but not binary files.
‘ What is text and what is binary? Technically, text_unicode.txt and text_utf8.txt are
‘ "binary", and text_windows1252.txt is "text".
‘ Let’s read each of them with the FileSystemObject anyway to see what
‘ we get. Do the bytes get mangled when reading as text? We’ll see…
Set fs=Server.CreateObject("Scripting.FileSystemObject")
Set f=fs.OpenTextFile(Server.MapPath("/data/text_unicode.txt"), 1)
txtUnicode = f.ReadAll
f.Close
Set f=fs.OpenTextFile(Server.MapPath("/data/text_utf8.txt"), 1)
txtUtf8 = f.ReadAll
f.Close
Set f=fs.OpenTextFile(Server.MapPath("/data/text_windows1252.txt"), 1)
txtAnsi = f.ReadAll
f.Close
‘ Use ChilkatUtil.CkData to examine the bytes (by calling GetHex to get
‘ a hex-encoded string of the raw byte data).
‘ The ChilkatUtil.dll is included with most downloads, including
‘ the Chilkat Charset ActiveX (http://www.chilkatsoft.com/download/CharsetActiveX.msi)
‘ IMPORTANT: Download the latest version of the Chilkat Charset ActiveX because
‘ the ChilkatUtil.CkData.GetHex method was added on 30-Oct-2006 for this example.
Set ckd = Server.CreateObject("ChilkatUtil.CkData")
ckd.LoadBinary txtUnicode
Response.Write "incorrect: " + ckd.GetHex() + "
"
‘ Display this: FF00FE00CD000000730000006C000000650000006E00000073000000…
‘ This is not good. We’ll have to change the way the data is read from the file.
‘ OpenTextFile/ReadAll is treating each individual byte as an ANSI char and
‘ converting it to 2-byte/char Unicode. Even the preamble bytes (FF FE) are assumed
‘ to be ANSI chars.
ckd.LoadBinary txtUtf8
Response.Write "incorrect: " + ckd.GetHex() + "
"
‘ Displays this: EF00BB00BF00C3008D0073006C0065006E0073006B0061002000
‘ The same problem happens here. Each byte is treated by OpenTextFile/ReadAll as an ASNI char,
‘ including the utf-8 preamble bytes (EF BB BF). The us-ascii bytes are correctly converted
‘ to Unicode. However, then 8bit non-us-ascii characters (i.e. the characters with diacritics,
‘ i.e. accented European language characters such as acute, grave, umlaut, etc.) are represented
‘ in utf-8 byte 2 bytes/char, and in this case each byte is incorrectly assumed to be a single
‘ ANSI character and converted to Unicode. For example, 0xC3 0×8D represent a single character
‘ represented by 0xCD in Windows-1252. Our code has incorrectly converted 0xC3 0×8D into two
‘ Unicode characters.
ckd.LoadBinary txtAnsi
Response.Write "correct: " + ckd.GetHex() + "
"
‘ Displays this: CD0073006C0065006E0073006B00610020002F0020004900
‘ This is OK. We can see that each Unicode character is represented correctly in the hex string,
‘ and the length of the hex string makes sense.
‘ Our experiment shows that you can only read ANSI text files with OpenTextFile/ReadAll.
‘ If you are fetching text data from a database, or from some other source, your first
‘ step should be to examine the bytes. Did the bytes get mucked up in the process?
‘ Use the same procedure as described here: load the data into a CkData by calling
‘ LoadBinary and then examine the bytes with GetHex.
‘ How would we get the text from text_unicode.txt and text_utf8.txt into an ASP string?
‘ Use ChilkatCharset2 to read and convert:
set cc = Server.CreateObject("ChilkatCharset2.ChilkatCharset2″)
cc.UnlockComponent "anything for 30-day trial"
fileData = cc.ReadFile(Server.MapPath("/data/text_utf8.txt"))
cc.FromCharset = "utf-8″
textUtf8 = cc.ConvertToUnicode(fileData)
ckd.LoadBinary textUtf8
Response.Write "correct: " + ckd.GetHex() + "
"
‘ The conversion from Unicode to Unicode is simply removing the preamble here…
fileData = cc.ReadFile(Server.MapPath("/data/text_unicode.txt"))
cc.FromCharset = "unicode"
textUnicode = cc.ConvertToUnicode(fileData)
ckd.LoadBinary textUnicode
Response.Write "correct: " + ckd.GetHex() + "
"
‘ Now that we have the input loaded into a string correctly, how do we emit
‘ utf-8 bytes?
Response.Write "
Utf-8 Text:
"
‘ Use BinaryWrite to emit utf-8 bytes
cc.ToCharset = "utf-8″
Response.BinaryWrite cc.ConvertFromUnicode(textUnicode)
Response.Write "
"
%>