Strings in PHP: Encoding, UTF-8, and Source File Format Considerations
In PHP, strings are sequences of bytes, not necessarily characters. This means PHP does not inherently enforce UTF-8 or any specific encoding—strings are just raw data until you process them.
Are Strings in PHP UTF-8?
PHP strings are binary-safe, meaning they can store UTF-8, ASCII, or any other encoding, but PHP itself does not assume UTF-8 unless explicitly handled.
Unlike languages like Python 3, PHP does not automatically treat strings as UTF-8.
Example: UTF-8 vs. Non-UTF-8
$str_utf8 = "Hello 😊"; // This is UTF-8 encoded text echo strlen($str_utf8); // Output may vary based on encoding
- "strlen()" returns the number of bytes, not characters.
- UTF-8 characters may take multiple bytes, so "strlen("😊")" is not 1 (it's 4 bytes in UTF-8).
To handle UTF-8 properly, use "mb_strlen()" from the mbstring extension:
echo mb_strlen($str_utf8, "UTF-8"); // Outputs correct character count
Does PHP Source File Encoding Matter?
Yes, it does matter how a PHP source file is encoded, especially when working with string literals.
UTF-8 vs. ANSI (Windows-1252, ISO-8859-1, etc.)
- If the PHP source file is saved as ANSI (Windows-1252, ISO-8859-1, etc.), non-ASCII characters in string literals may not display correctly.
- If saved as UTF-8 (without BOM), PHP can properly handle multibyte characters.
Example of Encoding Issues
Case 1: Source File Saved as UTF-8 (Recommended)
$str = "Café"; echo $str; // Correctly outputs: Café
- This works as expected because UTF-8 preserves special characters.
Case 2: Source File Saved as ANSI (Windows-1252)
$str = "Café"; echo $str;
- The output may display incorrectly if PHP expects UTF-8 but the file is ANSI.
- PHP treats the raw bytes as UTF-8 (by default in modern environments).
- The character é (é in Windows-1252 = 0xE9) gets misinterpreted if read as UTF-8.
How to Ensure Correct Encoding
Save PHP files as "UTF-8 without BOM" in text editors.