Tuesday, July 5, 2011

Character Encoding Fun!

Let's talk about character encoding. This seems to be a common blank area of knowledge for a lot of developers.

Joel Spolsky found this to be true, so he wrote this great article about character encoding and Unicode. I really recommend that you give it a read. It's a little old (2003), but still completely relevant.

If you are feeling too lazy to read his summary (I blame summer), you can read my even shorter summary.

1) There is no such thing as "plain text strings". You should not assume any given string is in ASCII. You, in fact, have no idea what the string means until you know how it's encoded.

2) Unicode is a character set that to hopes include characters from almost all languages. Unicode is not an encoding though. Older character sets, like ASCII, mapped characters ('A') to numbers (65), which got encoded as the binary representation of that number. Unicode maps characters to something called code points. These code points look something like U+0065. These code points are then encoded using some encoding system. There are many ways to do this encoding, but perhaps the most common is UTF-8.

3) Unicode is not always encoded as 2 bytes. UTF-16 is a specific encoding that encodes Unicode in (at most) 2 bytes. This is not true in general. For example, UTF-8 can be up to 4-bytes long, and UTF-32 is always 4-bytes.

4) UTF-8 is backwards compatible with ASCII for the first 8 bits. This means that UTF-8 is backwards compatible with ASCII.

5) Code points can be encoded any many ways. You can even encode Unicode code points using old-school ASCII encoding. What happens to code points that ASCII encoding doesn't define? They show up as ?. If you've ever seen international data that appears as ????????, it means that the encoding they are using doesn't support those code points.

I hope this fills in some of these character set and encoding knowledge holes. :) Now I should probably do one of those assignments I have due this week (>_<). School terms in the summer suck.


  1. 2) Unicode is a character set that hopes to* include characters from almost all languages.

  2. 4) UTF-8 is backwards compatible with ASCII for the first 8 bits*.