Encoding and Unicode

What I Have Learned

  1. Computers only understand bytes.
  2. Unicode is a system of mapping characters to numbers that accommodates all the world’s languages in a single system. This means every character in every language has its own unique Unicode number (a/k/a ‘code point’)
  3. An encoding, such as UTF-8 or UTF-16, is a system of translating Unicode code points to bytes so that they can be stored on a computer.
  4. Python 3 string objects are sequences of Unicode code points. They are not encoded, because by definition, a Python string object is not made up of bytes.
  5. Python string objects can be encoded as bytes objects, and bytes objects can be decoded to string objects.
  6. The encoding of a given text file, if not known, cannot be guessed with 100% accuracy: https://forum.sublimetext.com/t/how-does-sublime-detect-file-encodings/16194
  7. Python uses the default locale encoding for opening text files, when no encoding is specified.

Encodings

UTF-32 = 4 bytes per Unicode character UTF-16, UTF-8 = variable length encoding UTF-8 uses “signal bits” to indicate how many bytes a character consists of

Default Encoding

How to determine the default encoding - example
::
>>> import locale
>>> locale.getpreferredencoding()
'cp1252'