Monday, September 27, 2010

When Good Floating Point Goes Bad?

We have a handful of linear algebra/geometry routines in X-Plane that simplify writing geometric tests. In their heart, they almost all turn into dot products. So: when can we not trust them?
  • For real floating point data, dot products may not go to zero; the zero dot product is at the heart of "is this a left or right turn" and "which side of a line am I on". Rounding errors may force points to fall to one side of a line or another.
  • Even more weirdly, there's no guarantee that points will consistently fall on one side of a line or another; the rounding errors need to be treated as effectively random. A point that is moving across a line may give 'jittery' results as the point slowly crosses the line.
  • Order matters. Given the same theoretical line, defining it by swapping the input points (e.g. line AB instead of BA) may have unpredictable effects on side-of tests for very small distances.
  • Finally, there is no function more hellish than intersection. The more parallel the lines, the more completely insane the intersection results become. Serious paranoia is advised when dealing with intersections, because the routines can give you a positive result ("yes there was an intersection") with lunatic results ("and the intersection is on Mars"). I usually cope with this by using a dot product test to case out the near-collinear case, which is then handled in a different algorithm that doesn't require clean intersections.

Friday, September 03, 2010

OpenAL on Linux, Part 27

This bug has effected X-Plane 8 and 9; we were able to recut 9 to work around it, but X-Plane 8 is a closed product. Here's the short story:
  • A while ago intrepid developers replaced the implementation of libopenal on Linux with a new complete rewrite.
  • When they did so, they raised the major version number.
Huh??? This caused naive application developers like me to say things like "what the hell are you guys doing? The whole point of dynamic linking is that you can replace implementations without breaking my app. So why did you break my app?"

The change in major version breaks the link to X-Plane, and would be appropriate if the library wasn't compatible.

Yesterday someone finally posted a list of dropped ABI symbols in the new OpenAL implementation. They are all extension symbols except for alBufferAppendData. So I can't deny that symbols are dropped and that is an ABI breakage. The question is: should the soname be revised?


Most of the symbols missing are _LOKI. For those not familiar with OpenGL extensions (from which the OpenAL extension concept is stolen^H^H^H^H^H^Hborrowed) the idea is this: an app initializes the library, queries some kind of string to see what additional non-core features the library supports, and then resolves function pointers at run-time, using function pointers only once the extension string is present.

Therefore it's really important that the major version of the shared object not change when an extension is removed; the extension is not part of the ABI, applications should not (and cannot) depend on it being present at link-time, and an extension may not function without specialized hardware.


There is one mystery symbol: alBufferAppendData, which is present without a decoration. From what I can tell from the annotated OpenAL 1.0 specification, "append-data" was a proposed streaming scheme that was eventually moved to an extension when it was dropped from the core. It's not in the 1.0 spec and it's not in the 1.1 spec.

So this strikes me as a bug in the implementation of the original library: it exports a symbol that shouldn't be there. Does it make sense to raise the major version of the .so because the symbol has been dropped? I don't think so, but I can see how you could argue it both ways.

The argument for dropping it is this: if the major version is changed, then the old and new OpenAL implementations can live side by side, and all applications are happy. Since alBufferAppendData is not trivial functionality, this would be better than expecting the new implementation to support alBufferAppendData for historic reasons.

But this is not at all what is happening; instead distributions are purging (the old implementation) when they bring in the new one, and then asking applications to recompile themselves.

In other words, because some number of applications may be using a function that is not in any OpenAL specification but is in the old implementation, they have renamed the shared library, forcing everyone to recompile. In other words, they have replaced the convenience of having some games be broken with the convenience of having all games be broken.

(In X-Plane we work around this by simply dlopening either or, whichever one we can find. Since both implement the core spec symbols, this works fine.)

Wednesday, September 01, 2010

I Just Saw a Race Condition

I was debugging a threading bug and something truly bizarre happened to me: I was printing variables and when I went back to re-print a variable, it had changed on me! This was without actually ever running the program.

As far as I can tell, this is what happened:
  1. The variable I was printing was subject to writes in a race condition - some other thread was splatting it.
  2. After printing the variable, I printed some pieces of an STL container, which had to execute code in the attached process, which temporarily released all threads.
  3. Thus when I turn around, the program has been running.
In some ways, it was a really lucky find...after scratching my head and going "it's 2:30 AM and I've been drinking...that didn't just happen" I realized that the variable in question was being passed into the function by reference and thus might be secretly global (which was of course the real bug). Had I not seen the reference var get splatted under my nose it would have taken a lot more printing to find the problem.