Strings in Python: UTF-8 and Source File Encoding

In Python, strings are textual data represented as sequences of Unicode characters. Since Python 3, all strings ("str" type) are Unicode by default, and UTF-8 is the standard encoding for source code files.


Are Strings in Python UTF-8?

Yes, all "str" objects in Python 3 use Unicode. By default, Python source files are expected to be UTF-8 encoded. However, "bytes" objects are not UTF-8—they are raw byte sequences.

Example: Unicode Strings in Python

s = "Café 😊"  # Unicode string
print(type(s))  # Output: <class 'str'>
print(len(s))   # Output: 6 (Counts characters, not bytes)
  • Python 3 strings ("str") store Unicode characters, not bytes.
  • Unlike Python 2, where "str" was ASCII by default, Python 3 makes all strings Unicode.

How Source File Encoding Affects String Literals

Yes, the source file encoding matters if it contains non-ASCII characters in string literals.

Case 1: Source File Saved as UTF-8 (Recommended)

s = "Café 😊"
print(s)

Works correctly, since Python 3 assumes UTF-8 encoding for source files.

Case 2: Source File Saved as ANSI (Windows-1252, ISO-8859-1)

If the source file is saved as ANSI (Windows-1252, ISO-8859-1, etc.), Python may misinterpret non-ASCII characters.

  • Running the script could result in:
    SyntaxError: Non-UTF-8 code starting with '\xE9' in file script.py
      
  • Python expects UTF-8 by default, so encoding mismatches cause errors.

  • How to Ensure Proper UTF-8 Handling

    Always save Python files as UTF-8 in editors like VS Code, PyCharm, or Notepad++.

    Specify encoding explicitly if needed (for compatibility with older Python versions):

    # -*- coding: utf-8 -*-
    s = "Café 😊"
    print(s)
    

    Use "encode()" and "decode()" for byte-string conversions:

    s = "Café"
    b = s.encode("utf-8")  # Convert to bytes
    print(b)  # Output: b'Caf\xc3\xa9'
    s2 = b.decode("utf-8")  # Convert back to string
    print(s2)  # Output: Café
    

    Handle file I/O correctly:

    with open("file.txt", "r", encoding="utf-8") as f:
    data = f.read()