Thursday 2 June 2011

C and UTF-8 data in Linux

During the last few days, I found myself struggling to create a PHP extension what would allow PHP to convert common characters between Greek and Latin, so that typed codes would for instance, always use the same version of 'A' no matter the language that the user keyboard is switched to. You see whether you type a Latin 'A' (U+0041) or a Greek 'Α' (U+0391) the two letters appear to be the same despite the difference in their internal representation. The same applies to other common letters like 'E', 'P' and 'Y'

I reckoned that this conversion and translation of characters based on their position on some given string would be an excellent chance to remember string pointers from my student C days, so I set out to create a library with functions called latinToGreek() and greekToLatin() that would do the conversion and return the new string.

Like most people in similar positions, I tried googling for C and Unicode howtos and run into some very good articles but nothing was in the form of a recipe that I could follow. I did read many of them and managed to get the work done. The basic idea here is that since UTF-8 uses a variable number of bytes per character, we need to convert UTF-8 strings into wchat_t strings, then process them like they were ordinary constant length null terminated character arrays and finally convert them back to UTF-8 single byte character strings in order for them to display correctly.

Following is the check list of things to do before actually playing with UTF-8 encoded Unicode.

  1. Use setlocale to set the current program locale to something that supports UTF-8. For example 'en_US.UTF-8' is just fine.
    Bear in mind, that the default locale for C programs in the "C" locale and unless you set the locale correctly nothing will work as expected.
  2. Read your input using normal char[] arrays. Just make sure that hey are big enough. Remember that UTF-8, uses one, two or even more bytes per individual character, so a logical guess would be to allocate an array at least twice the size of your maximum anticipated input.
    Do not forget that good old strlen() will still give you the size of your input in bytes, but not in characters.
  3. Convert multibyte input to wchar_t[] input in order to perform any processing. Remember that wchar_t constants need to be prefixed by an L, so '\0' is now L'\0'.
    Functions mbstowcs and wcstombs can be used to perform the conversion to and from.
  4. Perform your processing as you would ordinarily do using wchat_t data and wchar_t speciffic functions. For example toupper() for wchar_t is now towupper().
  5. Convert your data back to multibyte format using wcstombs and return them to the user.

... and that's about it.

2 comments :

adamo said...

I wonder whether you could achieve anything like this using libutf.

Athanassios Bakalidis said...

Possibly yes, however it took 56 lines of C code (comments and blank lines included) to provide this functionality.

I believe that you cannot get away with less, not to mention the speed benefits coming for plain C code.