There was a time when you could determine the size of a file by counting the number of characters it had. One character equates to one byte. Simple. In fact, it was how I found the office perpetrator who printed out a nasty letter for everyone to see. I went through all the print logs and counted the bytes.
In many cases, this is still true. However, for languages, such as Chinese, with thousands of characters, 8 bits (2^8 = 256) is not enough. For this reason a multitude of encoding standards (ISO-8859, Mac OS Roman, Big5, MS-Windows char sets, etc) have been implemented but it has been a headache to make consistent across applications and delivery systems. In some cases, in order to have multiple encodings or character sets in one document would require yet another encoding standard or would just be impossible. This not only applies to text documents, but web pages and databases as well.
We needed a standard that encompass it all. That standard is called Unicode.
What is Unicode?
Unicode is just a giant mapping table of numbers (code points) to characters. That’s about it. The kicker is that it includes every character imaginable on this planet. Basically it’s the superset of all character sets in existence today. It even includes ancient scripts like Egyptian Hieroglyphs, Cuneiform, and Gothic. The characters make up the code space.
Unicode encodings (e.g., UTC-8) specify how these numbers (with their own code points) are represented as bits.
Consists of 17 planes of 65,536 (=2^16) code points each. That’s 1,114,112 code points. That’s enough code points to map all past, present, and future characters created by mankind. The first plane, Basic Multilingual Plane (BMP) contains most commonly used characters.
What’s the difference between a character set and an encoding?
Character sets are technically just list of distinct characters and symbols. They could be used by multiple languages (e.g., Latin-1 is used for the Americas and Western Europe).
Encoding is the way these characters are stored in memory. An encoding maps these characters to a binary representation.
Character sets that have encodings are called coded character sets. Unsurprisingly, this is a bit confusing because many systems use them interchangeably. For example, MySQL calls a characters and their encodings simply as a character set. What they really mean is a coded character set (or code pages).
Every encoding must have a character set associated with it but a character set could have multiple encodings. The most relevant example of this is the Unicode character set with multiple encodings (UTF-8, UTF-16 BE, UTF-16 LE, UTF-32, etc). The same character in one encoding could be represented by a larger/smaller number of bytes in another encoding.
This W3C article does a fine job explaining this.
What are code pages?
It’s mostly a Microsoft Windows-specific encoding that is based on standard encodings with a few modifications. It could also be generically a coded character set.
UTF-8 vs UTF-16 vs UTF-32
- Variable-length 8 bit code units
- Backward compatible with ASCII without having to do deal with endianness or byte order marks (BOM). The first 128 characters correspond one-to-one with ASCII.
- Some commonly used characters could be various lengths which could cause indexing and calculating a code point slow.
- Variable-length 16 bit code units
- Great if ASCII doesn’t dominant the document. It’ll use 2 bytes total whereas UTF-8 will use 3 or more bytes. e.g., East Asian languages required 2 bytes in UTF-16 whereas in UTF-8 it would be at least 3.
- If using primarily US-ASCII strings, there will be lots of null bytes.
- 32 bit code units
- You don’t need to decode the code point as it’s given to you in it’s purest 32-bit format.
How does character sets and encoding relate to fonts?
A font defines the “glyphs” for usually a single character set or a subset of a character set. If there’s a character undefined in the font, you’ll typically get a replacement character like a square box or question mark.
Basically, fonts are glyphs that are mapped to code points in a coded character set.
At this time, most systems are using UTF-8. It’s efficient as far as storage (as long as it’s mostly ASCII characters). It has the possibility of mapping any character imaginable so there’s really no reason not to use it.
When you type on your keyboard, you’re using a certain encoding scheme. When you save that file and display the text again using the same encoding, you’ll get consistent results. The biggest problem we run into is seeing random looking characters in our files. The only explanation for this is that the encoding used to view the file is incorrect.
It’ll be important to note: conversion from one encoding to another is not for the faint of heart. You have to know what you’re doing or you’ll lose your original bits forever. Sometimes it’s not even possible to perform the conversion.
From this point forward, a byte no longer equates to character. Be wary of the encoding scheme used, especially if you start to see a snowman and cellphones in your CSV file.
Bottomline: Use UTF-8.