Strings in Python: UTF-8 and Source File Encoding
In Python, strings are textual data represented as sequences of Unicode characters. Since Python 3, all strings ("str" type) are Unicode by default, and UTF-8 is the standard encoding for source code files.
Are Strings in Python UTF-8?
Yes, all "str" objects in Python 3 use Unicode. By default, Python source files are expected to be UTF-8 encoded. However, "bytes" objects are not UTF-8—they are raw byte sequences.
Example: Unicode Strings in Python
s = "Café 😊" # Unicode string print(type(s)) # Output: <class 'str'> print(len(s)) # Output: 6 (Counts characters, not bytes)
- Python 3 strings ("str") store Unicode characters, not bytes.
- Unlike Python 2, where "str" was ASCII by default, Python 3 makes all strings Unicode.
How Source File Encoding Affects String Literals
Yes, the source file encoding matters if it contains non-ASCII characters in string literals.
Case 1: Source File Saved as UTF-8 (Recommended)
s = "Café 😊" print(s)
Works correctly, since Python 3 assumes UTF-8 encoding for source files.
Case 2: Source File Saved as ANSI (Windows-1252, ISO-8859-1)
If the source file is saved as ANSI (Windows-1252, ISO-8859-1, etc.), Python may misinterpret non-ASCII characters.
- Running the script could result in:
SyntaxError: Non-UTF-8 code starting with '\xE9' in file script.py
How to Ensure Proper UTF-8 Handling
Always save Python files as UTF-8 in editors like VS Code, PyCharm, or Notepad++.
Specify encoding explicitly if needed (for compatibility with older Python versions):
# -*- coding: utf-8 -*- s = "Café 😊" print(s)
Use "encode()" and "decode()" for byte-string conversions:
s = "Café" b = s.encode("utf-8") # Convert to bytes print(b) # Output: b'Caf\xc3\xa9' s2 = b.decode("utf-8") # Convert back to string print(s2) # Output: Café
Handle file I/O correctly:
with open("file.txt", "r", encoding="utf-8") as f: data = f.read()