Wednesday, March 27, 2013

Scratching Your Own Itch

It has been said that most program development in the open source community has been done by individuals scratching their own itch. In other words when there was a need they came up with their own solution to that need. Linux itself is an example of that when Linus Torvalds wanted an operating system that took advantage of the specific features of a 386 processor. He released his code, so other students wouldn't have to reinvent the wheel, and they could play with it. A good number of us today use that kernel. Some things you may want may not exist on Linux, but if you are interested in taking the time and effort to scratch your own itch this post is for you.

Development of software on Linux can be done in almost any language. The Linux/Unix community probably supports more languages on their platform than Windows and Mac OS X do. Before you say it, I do know that Mac OS X is Unix, but unless you use Mac Ports there isn't much support for doing this. Two very popular languages for development are C and Python. These two have probably the most libraries out of the other languages. C++ can use C libraries, but libraries with C++ features are not as common.

Since you are going to probably want more than just command-line utilities you should know what toolkits you can use for making graphical applications. GTK+ is the library behind Gnome, Mate, Cinnamon, XFCE and LXDE. Currently there is GTK+ 2 and 3. 2 is still more common on the distributions that are not as bleeding edge such as Debian, Red Hat and its clones, Slackware and others. Depending on your target environment you may want to take that into consideration. GTK+ is written in C and works with it by default, but Python also has bindings to the library in a package called PyGTK. The other major toolkit is Qt which is what KDE is based on. Qt is written in C++, but it can be used by Python as well. Both toolkits will work on Windows if you install the libraries there.

Another thing that should be mentioned is Python has two versions that are being maintained and worked on right now. Those are 2.x and 3.x. 3.x code is not backwards compatible with 2.x, and not all 2.x code runs on 3.x. As of now, most libraries support the 2.x series, and this why sites like Udacity as well as books and tutorials teach version 2 over 3.

There are a good number of development environments on the Linux platform. I have tried IDEs and never really liked them, but some of the ones that I have tried are Eclipse, Anjuta and Netbeans. I use Emacs as my text editor, but Vim, gedit, Kate and others work well for that. Debugging I tend to do using gdb-tui or Insight. Python of course has its own interpreter, but there are a few other implementations out there though I have never tried them. The two popular C compilers right now are gcc and clang. They both have C++ and Objective-C compilers in the suite as well. I like clang better for the static analyzer that tells you more of what's wrong with your code or what can be problematic. Clang is also C99 compliant, but I don't believe it it has an option for C89 only code which gcc does. As for your build process, you will want to read up on makefiles. Make is the standard way of building on the Unix platform. There are other tools that make the process easier such as CMake and Autotools, but I have never used either.

Hopefully that is enough to get you started with your itch scratching. Between Google, some good books, Youtube, and some trial and error you should be good to go. Another helpful resource for finding headers or functions that you may need is the apropos command, but that depends on your already having the needed library installed. Devhelp is a good tool for reference with Gnome related libraries, but not learning how to use them. Happy coding!

Thursday, January 3, 2013

UTF-8: An Amazingly Simple Way of Internationalization

When you venture outside of the realm of the English language, it can be enough of a challenge just learning a new language, but imagine that challenge as a programmer who is trying to internationalize software. This was something I was faced with about three years ago as I started to develop learncurve, a program designed to make flash card review effective and efficient. As I started  development, I already knew that the data type for strings in C can only hold 8 bit values. English doesn't need any more than 7 bits, but other languages have different writing systems that won't all fit in just 8 bits. So I began reading up on how to support these different languages. The answer was Unicode, but which Unicode?

For the first two and a half years of development I thought I needed to use the wchar_t type which depending on your platform provides a character that can hold 16 to 32 bits of information. I thought a char wouldn't cut it because I really didn't do my homework on Unicode. You see, I thought I needed UTF-16 because I thought UTF-8 was too small, but on doing more reading I found very little information regarding programming using UTF-16 or UTF-32 for that matter. Most of what I found was concerning UTF-8. Somehow I finally did more reading on UTF-8 and found out that it covers the whole Unicode character set using a multi-byte sequence to hold characters rather than a single char as in C.

In fact UTF-16 works this way as well, but since it is a larger data type it has to deal with those extra zeros the extra 8 bits bring along. This wouldn't be an issue except for how a computer deals with those zeros, and depending on the processor that you use it can make a big difference. For those familiar with the concept we are talking about big endianess and little endianess. It determines how a computer conceives the value of a sequence of binary digits to put it simply. After reading about the complexities of UTF-16, I decided to do more digging on using UTF-8.

Since UTF-8 could in theory handle all the characters in the Unicode character set only using char, I had to test and see if it worked. So I did a simple test, having it read in characters from Greek, Hebrew, Arabic, Chinese, and to really stress test it all the way past 16 bits of information I used Cuneiform. All of them displayed in the terminal just using normal char functions that are by no means supposed to handle wide characters. Now this approach does have its drawbacks. For starters we cannot index a UTF-8 string the same way we do a normal C string because of the fact it is mulit-byte. To extract a character out of a multi-byte string by array position, you would have to do some extra work. For my program it does not matter as I am not selecting character sequences that way. The other drawback is sometimes the terminal will not display a character high in the range correctly. This has more to do with the font that the terminal is using, but if you were to make a command line application using these high end characters such as the CJK block or Cuneiform, you would need to instruct the user to use the right font or find a way to set that for them. Another disadvantage is until the recent C standard (C11), support for Unicode string literals is not in C. With the new standard it is, but it will be sometime before that feature is supported long enough to use it in production C code.

Using UTF-8 in C is as simple as including the locale.h header and using the setlocale function. Here is an example:

#include <locale.h>
#include <stdio.h>

int main()
{
    char buf[300];
    FILE *fp;

    setlocale(LC_ALL, "en_US.UTF-8");
    setlocale(LC_CTYPE, "en_US.UTF-8");
    setlocale(LC_MESSAGES, "en_US.UTF-8");
    setlocale(LC_NUMERIC, "en_US.UTF-8");
    setlocale(LC_COLLATE, "en_US.UTF-8");

    fp = fopen("test.txt", "r");
    while ((fgets(buf, 300, fp)) != NULL) {
        printf("%s", buf);
    }

    return 0;
}

Using a UTF-8 enabled text editor you can save the language you want to test in there, and then have this code run. You will of course have to do more study for yourself, but hopefully this will give you the start that you need.