Strings in Perl: UTF-8 and Encoding Considerations

In Perl, strings are sequences of bytes, not necessarily characters. Perl provides powerful Unicode (UTF-8) support, but it requires explicit handling. The encoding of a Perl source file does matter when dealing with string literals, as it affects how characters are interpreted.

Are Strings in Perl UTF-8?

Perl strings can contain UTF-8 characters, but they are not automatically UTF-8 encoded unless explicitly handled.

Perl distinguishes between "bytes" and "characters", which affects how strings are stored and processed.

A Perl string does not assume UTF-8 unless explicitly declared or decoded.

UTF-8 vs. Raw Bytes

use strict;
use warnings;
use utf8;  # Interprets literal source code strings as UTF-8
my $str = "Café";
print length($str), "\n";  # Output: 4 (UTF-8 stores "é" as 2 bytes)

Perl treats "$str" as a sequence of bytes.
The "é" character is 2 bytes in UTF-8, so "length($str)" returns "4".

How Source File Encoding Affects String Literals

Yes, saving a Perl source file in UTF-8 or ANSI (Windows-1252) affects how string literals are interpreted.

Case 1: Source File Saved as UTF-8 (Recommended)

use utf8;  # Ensures source file is interpreted as UTF-8
my $str = "Café";
print $str;

Perl correctly interprets "Café" as a UTF-8 string.

Case 2: Source File Saved as ANSI (Windows-1252)

# No utf8 pragma
my $str = "Café";
print $str;

If saved as ANSI (Windows-1252):

The "é" character is stored as "\xE9" instead of the UTF-8 sequence "\xC3\xA9".
If Perl expects UTF-8, this may cause garbled output or unexpected behavior.

3. How to Ensure Proper UTF-8 Handling

Tell Perl to interpret source code as UTF-8:

use utf8;