Strings in Perl: UTF-8 and Encoding Considerations
In Perl, strings are sequences of bytes, not necessarily characters. Perl provides powerful Unicode (UTF-8) support, but it requires explicit handling. The encoding of a Perl source file does matter when dealing with string literals, as it affects how characters are interpreted.
Are Strings in Perl UTF-8?
Perl strings can contain UTF-8 characters, but they are not automatically UTF-8 encoded unless explicitly handled.
Perl distinguishes between "bytes" and "characters", which affects how strings are stored and processed.
A Perl string does not assume UTF-8 unless explicitly declared or decoded.
UTF-8 vs. Raw Bytes
use strict; use warnings; use utf8; # Interprets literal source code strings as UTF-8 my $str = "Café"; print length($str), "\n"; # Output: 4 (UTF-8 stores "é" as 2 bytes)
- Perl treats "$str" as a sequence of bytes.
- The "é" character is 2 bytes in UTF-8, so "length($str)" returns "4".
How Source File Encoding Affects String Literals
Yes, saving a Perl source file in UTF-8 or ANSI (Windows-1252) affects how string literals are interpreted.
Case 1: Source File Saved as UTF-8 (Recommended)
use utf8; # Ensures source file is interpreted as UTF-8 my $str = "Café"; print $str;
Perl correctly interprets "Café" as a UTF-8 string.
Case 2: Source File Saved as ANSI (Windows-1252)
# No utf8 pragma my $str = "Café"; print $str;
If saved as ANSI (Windows-1252):
- The "é" character is stored as "\xE9" instead of the UTF-8 sequence "\xC3\xA9".
- If Perl expects UTF-8, this may cause garbled output or unexpected behavior.
3. How to Ensure Proper UTF-8 Handling
Tell Perl to interpret source code as UTF-8:
use utf8;