A 5mins crash course to Unicode and Encoding

  • In the beginning of time there was ASCII to represent 128 characters (English alphabet, 0-9 numbers, basic symbols and escape chars included).
  • Representation-wise, each ASCII char was stored in 7 bits so basically each char had plenty of room to fit in a byte.
  • But hold on, 1 byte has an extra bit compared to the 7 bits needed for the ASCII char representation so a lot of people thought it’s a good idea to make a good use of those extra spare 128 slots. Therefore, in different countries different chars were associated to those slots so there was not possible for two different languages to co-exist in the same computer. Interesting times.
  • Chaos was inevitable and kittens have gone missing => Unicode came to the rescue.
  • Unicode stores all characters and symbols in code points like U+0041 (the number is hexadecimal). In Windows XP run>charmap to have a better look.
  • Awesome. But hold on. How to store those code points in memory? Enter Encoding Unicode. UTF-8 for instance is using either one or two or.. six bytes. And yes, the first 128 characters represented out of the first byte is actually the ASCII characters. That’s why the lucky English language never had any issues with representing its characters.
  • But for the rest of the world, inconsistencies between Unicode code points tried to be matched in a different encoding than the one expected leads to lovely little characters we normally have seen in web browsers such as: ?,>�.
  • other encodings include UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European)
  • Moral of the story: to make the life of the web browser or IDE easier rather than lucky guessing, include wherever you can (even in emails):
    Content-Type: text/plain; charset=”UTF-8″ or whichever encoding you are aiming for.