Strings in Python: UTF-8 and Source File Encoding

In Python, strings are textual data represented as sequences of Unicode characters. Since Python 3, all strings ("str" type) are Unicode by default, and UTF-8 is the standard encoding for source code files.

Are Strings in Python UTF-8?

Yes, all "str" objects in Python 3 use Unicode. By default, Python source files are expected to be UTF-8 encoded. However, "bytes" objects are not UTF-8—they are raw byte sequences.

Example: Unicode Strings in Python

s = "Café 😊"  # Unicode string
print(type(s))  # Output: <class 'str'>
print(len(s))   # Output: 6 (Counts characters, not bytes)

Python 3 strings ("str") store Unicode characters, not bytes.
Unlike Python 2, where "str" was ASCII by default, Python 3 makes all strings Unicode.

How Source File Encoding Affects String Literals

Yes, the source file encoding matters if it contains non-ASCII characters in string literals.

Case 1: Source File Saved as UTF-8 (Recommended)

s = "Café 😊"
print(s)

Works correctly, since Python 3 assumes UTF-8 encoding for source files.

Case 2: Source File Saved as ANSI (Windows-1252, ISO-8859-1)

If the source file is saved as ANSI (Windows-1252, ISO-8859-1, etc.), Python may misinterpret non-ASCII characters.

Running the script could result in:

SyntaxError: Non-UTF-8 code starting with '\xE9' in file script.py

Python expects UTF-8 by default, so encoding mismatches cause errors.

How to Ensure Proper UTF-8 Handling

Always save Python files as UTF-8 in editors like VS Code, PyCharm, or Notepad++.

Specify encoding explicitly if needed (for compatibility with older Python versions):

# -*- coding: utf-8 -*-
s = "Café 😊"
print(s)

Use "encode()" and "decode()" for byte-string conversions:

s = "Café"
b = s.encode("utf-8")  # Convert to bytes
print(b)  # Output: b'Caf\xc3\xa9'
s2 = b.decode("utf-8")  # Convert back to string
print(s2)  # Output: Café

Handle file I/O correctly:

with open("file.txt", "r", encoding="utf-8") as f:
data = f.read()