The Hacks of Life: An Unfair Unicode Rant

I would not be the first person to be grumpy that you can't use UTF8 on Windows, and I must admit that my desire for that option comes from pure laziness. Perhaps it comes down to Microsoft not wanting to change their huge code base and me not wanting to change mine, but I must admit that X-Plane's code is still slightly smaller than, um, all of Windows. :-)

It turns out that the problem of dealing with Unicode isn't as bad as it seems - for reasons of pure luck (probably partly coming from X-Plane not being very file-system or text-centric) most of our text-based contact with other systems happens in only three or four translation units, so it's pretty easy to find and trap the bottlenecks.

The ugliest problem turns out to be the C runtime. For historical reasons X-Plane on Windows is built using Metrowerks CodeWarrior 8. The bad news is that the C runtime API uses single byte file paths, and does not support UTF8. The good news is that they give you the source.

So it looks to me like what we're going to have to do is to put a converter code in the bottom of the C library to convert 8 bit UTF8 paths (passed in to X-Plane) into UTF16 paths that we can then pass to the "W" variants of the file routines. Fortunately since this happens "just in time" we only have to look for Win32 file system calls inside the C runtime source, which are already partitioned into a "Win32 specific" section.

We're using UTF8 internally because it allows us to not change the size of any internal memory buffers or data structures - if we did we would have to trace every single instance to find which ones get written to a binary file format and convert there (and I can tell you off hand that we have a number of these cases).

Converting the string path to wide characters seems like a loss - trading an annoying but survivable bug (incorrect strings) into a set of worse ones (data loss, file corruption, and crashes).

And that brings me to the grumpy rant part: a lot of the documentation and comments I found online were simply focused on the "Microsoft" process, e.g. use TCHAR for all your strings, #define UNICODE to 1, clean up the wreckage of having changed the size of a fundamental data type in your entire app, go home happy.

The truth is, the rant isn't about Microsoft, and their "churn the whole code base, redefine the world" approach to Unicode (which I am sure was a lot more compelling when the BMP was the entire character set...with wide characters, you get the joy of debugging your entire app combined with the joy of dealing with variable-length characters!) - rather this rant is about the state of documentation and how programmers use it. And I don't think the documentation writers are to blame. Besides their having to approach a very difficult problem without ever having enough time and budget, I think they're just giving the market what it wants.

Documentation today is built around a combination of lowest-level reference (e.g. dictionary styled definitions of what a single function's parameters are) and recipes (that is, examples of how to do common tasks). Low level reference is of course mandatory, but it's not enough. The gap is is in conceptual documentation. The "why" of programming is under-documented, and from what I can tell, even when the why is documented (in a blog post, jammed into the remarks of a function reference, in an overview to a module), it is often ignored by programmers who are either too busy, too lazy, or too dumb to care.

The idea of programming like this is absurd - to use a library whose design you don't understand is like to take a speech, run it through Google language services, and then give that speech in a language you don't speak in front of foreign dignitaries. Sure they might look at you like they understand most of it, but fundamental things could be going wrong because you don't understand what you mean* and you wouldn't even know.

Something I have learned via the school of hard knocks is the rule of "no squirrelly behavior". That is to say, if your application does something that is surprising and innocuous, it always pays to take the time to understand what is going (thus rendering the behavior unsurprising).

The alternative is to let the misunderstood behavior exist as "harmless", except that:

Since you don't really know what your app is doing, you won't be understanding what you say for all the code your write from that point on and
more importantly usually the harmless behavior is the tip of an iceberg that will sink your app in a way that is both much worse (crash, data corruption) and much harder to debug.

So in an attempt to end my rant and get back to the real work of drinking coffee and cursing:

Always take the time to understand why a library is the way it is before using it.
Always make sure you understand what your own code really implies.
If you have unexpected behavior, fix it early while it's easy to do so.

And my plea to documentation writers is: please tell us why, even if most programmers don't know why they need to know!

* This is part of Scott Meyers 50th rule from "Effective C++": say what you mean and understand what you say. It could be the best one sentence of programming advice you will ever get.

The Hacks of Life

Sunday, April 13, 2008

An Unfair Unicode Rant

No comments:

Post a Comment