Thursday, January 3, 2013

UTF-8: An Amazingly Simple Way of Internationalization

When you venture outside of the realm of the English language, it can be enough of a challenge just learning a new language, but imagine that challenge as a programmer who is trying to internationalize software. This was something I was faced with about three years ago as I started to develop learncurve, a program designed to make flash card review effective and efficient. As I started  development, I already knew that the data type for strings in C can only hold 8 bit values. English doesn't need any more than 7 bits, but other languages have different writing systems that won't all fit in just 8 bits. So I began reading up on how to support these different languages. The answer was Unicode, but which Unicode?

For the first two and a half years of development I thought I needed to use the wchar_t type which depending on your platform provides a character that can hold 16 to 32 bits of information. I thought a char wouldn't cut it because I really didn't do my homework on Unicode. You see, I thought I needed UTF-16 because I thought UTF-8 was too small, but on doing more reading I found very little information regarding programming using UTF-16 or UTF-32 for that matter. Most of what I found was concerning UTF-8. Somehow I finally did more reading on UTF-8 and found out that it covers the whole Unicode character set using a multi-byte sequence to hold characters rather than a single char as in C.

In fact UTF-16 works this way as well, but since it is a larger data type it has to deal with those extra zeros the extra 8 bits bring along. This wouldn't be an issue except for how a computer deals with those zeros, and depending on the processor that you use it can make a big difference. For those familiar with the concept we are talking about big endianess and little endianess. It determines how a computer conceives the value of a sequence of binary digits to put it simply. After reading about the complexities of UTF-16, I decided to do more digging on using UTF-8.

Since UTF-8 could in theory handle all the characters in the Unicode character set only using char, I had to test and see if it worked. So I did a simple test, having it read in characters from Greek, Hebrew, Arabic, Chinese, and to really stress test it all the way past 16 bits of information I used Cuneiform. All of them displayed in the terminal just using normal char functions that are by no means supposed to handle wide characters. Now this approach does have its drawbacks. For starters we cannot index a UTF-8 string the same way we do a normal C string because of the fact it is mulit-byte. To extract a character out of a multi-byte string by array position, you would have to do some extra work. For my program it does not matter as I am not selecting character sequences that way. The other drawback is sometimes the terminal will not display a character high in the range correctly. This has more to do with the font that the terminal is using, but if you were to make a command line application using these high end characters such as the CJK block or Cuneiform, you would need to instruct the user to use the right font or find a way to set that for them. Another disadvantage is until the recent C standard (C11), support for Unicode string literals is not in C. With the new standard it is, but it will be sometime before that feature is supported long enough to use it in production C code.

Using UTF-8 in C is as simple as including the locale.h header and using the setlocale function. Here is an example:

#include <locale.h>
#include <stdio.h>

int main()
{
    char buf[300];
    FILE *fp;

    setlocale(LC_ALL, "en_US.UTF-8");
    setlocale(LC_CTYPE, "en_US.UTF-8");
    setlocale(LC_MESSAGES, "en_US.UTF-8");
    setlocale(LC_NUMERIC, "en_US.UTF-8");
    setlocale(LC_COLLATE, "en_US.UTF-8");

    fp = fopen("test.txt", "r");
    while ((fgets(buf, 300, fp)) != NULL) {
        printf("%s", buf);
    }

    return 0;
}

Using a UTF-8 enabled text editor you can save the language you want to test in there, and then have this code run. You will of course have to do more study for yourself, but hopefully this will give you the start that you need.