Strings in PHP: Encoding, UTF-8, and Source File Format Considerations

In PHP, strings are sequences of bytes, not necessarily characters. This means PHP does not inherently enforce UTF-8 or any specific encoding—strings are just raw data until you process them.


Are Strings in PHP UTF-8?

PHP strings are binary-safe, meaning they can store UTF-8, ASCII, or any other encoding, but PHP itself does not assume UTF-8 unless explicitly handled.

Unlike languages like Python 3, PHP does not automatically treat strings as UTF-8.

Example: UTF-8 vs. Non-UTF-8

$str_utf8 = "Hello 😊";  // This is UTF-8 encoded text
echo strlen($str_utf8);  // Output may vary based on encoding
  • "strlen()" returns the number of bytes, not characters.
  • UTF-8 characters may take multiple bytes, so "strlen("😊")" is not 1 (it's 4 bytes in UTF-8).

To handle UTF-8 properly, use "mb_strlen()" from the mbstring extension:

echo mb_strlen($str_utf8, "UTF-8");  // Outputs correct character count

Does PHP Source File Encoding Matter?

Yes, it does matter how a PHP source file is encoded, especially when working with string literals.

UTF-8 vs. ANSI (Windows-1252, ISO-8859-1, etc.)

  • If the PHP source file is saved as ANSI (Windows-1252, ISO-8859-1, etc.), non-ASCII characters in string literals may not display correctly.
  • If saved as UTF-8 (without BOM), PHP can properly handle multibyte characters.

Example of Encoding Issues

Case 1: Source File Saved as UTF-8 (Recommended)
$str = "Café";
echo $str;  // Correctly outputs: Café
  • This works as expected because UTF-8 preserves special characters.
Case 2: Source File Saved as ANSI (Windows-1252)
$str = "Café";
echo $str;
  • The output may display incorrectly if PHP expects UTF-8 but the file is ANSI.
  • PHP treats the raw bytes as UTF-8 (by default in modern environments).
  • The character é (é in Windows-1252 = 0xE9) gets misinterpreted if read as UTF-8.

How to Ensure Correct Encoding

Save PHP files as "UTF-8 without BOM" in text editors.