The Hacks of Life

Friday, September 03, 2010

OpenAL on Linux, Part 27

This bug has effected X-Plane 8 and 9; we were able to recut 9 to work around it, but X-Plane 8 is a closed product. Here's the short story:

A while ago intrepid developers replaced the implementation of libopenal on Linux with a new complete rewrite.
When they did so, they raised the major version number.

Huh??? This caused naive application developers like me to say things like "what the hell are you guys doing? The whole point of dynamic linking is that you can replace implementations without breaking my app. So why did you break my app?"

The change in major version breaks the link to X-Plane, and would be appropriate if the library wasn't compatible.

Yesterday someone finally posted a list of dropped ABI symbols in the new OpenAL implementation. They are all extension symbols except for alBufferAppendData. So I can't deny that symbols are dropped and that is an ABI breakage. The question is: should the soname be revised?

Extensions

Most of the symbols missing are _LOKI. For those not familiar with OpenGL extensions (from which the OpenAL extension concept is stolen^H^H^H^H^H^Hborrowed) the idea is this: an app initializes the library, queries some kind of string to see what additional non-core features the library supports, and then resolves function pointers at run-time, using function pointers only once the extension string is present.

Therefore it's really important that the major version of the shared object not change when an extension is removed; the extension is not part of the ABI, applications should not (and cannot) depend on it being present at link-time, and an extension may not function without specialized hardware.

alBufferAppendData

There is one mystery symbol: alBufferAppendData, which is present without a decoration. From what I can tell from the annotated OpenAL 1.0 specification, "append-data" was a proposed streaming scheme that was eventually moved to an extension when it was dropped from the core. It's not in the 1.0 spec and it's not in the 1.1 spec.

So this strikes me as a bug in the implementation of the original library: it exports a symbol that shouldn't be there. Does it make sense to raise the major version of the .so because the symbol has been dropped? I don't think so, but I can see how you could argue it both ways.

The argument for dropping it is this: if the major version is changed, then the old and new OpenAL implementations can live side by side, and all applications are happy. Since alBufferAppendData is not trivial functionality, this would be better than expecting the new implementation to support alBufferAppendData for historic reasons.

But this is not at all what is happening; instead distributions are purging libopenal.so.0 (the old implementation) when they bring in the new one, and then asking applications to recompile themselves.

In other words, because some number of applications may be using a function that is not in any OpenAL specification but is in the old implementation, they have renamed the shared library, forcing everyone to recompile. In other words, they have replaced the convenience of having some games be broken with the convenience of having all games be broken.

(In X-Plane we work around this by simply dlopening either libopenal.so.0 or libopenal.so.1, whichever one we can find. Since both implement the core spec symbols, this works fine.)

Wednesday, September 01, 2010

I Just Saw a Race Condition

I was debugging a threading bug and something truly bizarre happened to me: I was printing variables and when I went back to re-print a variable, it had changed on me! This was without actually ever running the program.

As far as I can tell, this is what happened:

The variable I was printing was subject to writes in a race condition - some other thread was splatting it.
After printing the variable, I printed some pieces of an STL container, which had to execute code in the attached process, which temporarily released all threads.
Thus when I turn around, the program has been running.

In some ways, it was a really lucky find...after scratching my head and going "it's 2:30 AM and I've been drinking...that didn't just happen" I realized that the variable in question was being passed into the function by reference and thus might be secretly global (which was of course the real bug). Had I not seen the reference var get splatted under my nose it would have taken a lot more printing to find the problem.

Wednesday, August 11, 2010

When Is Your VBO Double Buffered?

A while ago I finally wrapped my head around this, and wrote a three part post trying to explain why you never get double-buffered behavior from a VBO unless you orphan it. This is going to be an attempt to explain the issues more succinctly and describe how to stream data through a VBO.

The Problem

The problem that OpenGL developers (myself included) crash into is a stall in the OpenGL pipeline when trying to specify vertex data that changes every frame. You go preparing new VBOs of meshes (for a particle system for example), and when you go into your favorite adaptive sampling profiler, you find that you're blocking in one of glMapBuffer or glBufferSubData.

The problem is that the GPU has a "lock" on your VBO until it finishes drawing from it, preventing you from changing the VBO's contents. You can't put the new mesh in there until the GPU is done with the old one.

To understand why this happens, it helps to play "what if I had to write the GL driver myself" and look at what a total pain in the ass it would be to fix this at the driver level. In particular, even if you did the work, your GL driver might be slower in the general case because of the overhead to be clever in this case.

Your VBO Isn't Really Double-Buffered

Sometimes VBOs are copied from system memory to VRAM to be used. We might naively think that if this were the case, then we could gain access to the original system copy to update it while the GPU uses the VRAM copy.

In practice, this would be insanely hard to implement. First, this scheme would only work when VBOs are being shadowed in VRAM (not the case a lot of the time) and when the VBO has already been copied to VRAM by the time we need to respecify its contents.

If we haven't copied the VBO to VRAM, we'd have to stop and block application code while we DMA the VBO into VRAM (assuming the DMA engine isn't busy doing something else). If DMA operations on the GPU have to be serialized into the general command queue, that means the DMA operation isn't going to happen for a while.

If that hasn't already convinced you that treating VRAM vs. main memory like a double buffer makes no sense, consider also that if main memory is to be released, the VRAM copy is no longer a cached shadow, it is now the only copy! We now have to mark this block as "do not purge". So we might be putting more pressure on VRAM by relying on it as a double buffer.

I won't even try to understand the complexity that a pending glReadPixels into the VBO would have. It should be clear at this point that even if your VBO seems double buffered by VRAM, for the purpose of streaming data, it's not.

Your VBO Isn't Made Up of Regions

You might not be using all of your VBO; you might draw from one half and update the other. glBufferSubData won't figure that out. In order for it to do so, it would have to:

Know the range of the VBO used by all pending GPU operations. (This is in theory possible with glDrawElementsRange, but not the older glDraw calls.)
Track the time stamp of each individual range to see how long we have to block for.

The GPU on our VBO has now changed from an integer time stamp to some kind of diverse region of time stamps with set operations. It's not surprising that the drivers don't do this. If you have a pending operation on any part of your VBO, glBufferSubData will block.

One Way To Get a Double Buffer

The one way to get a double buffer on older OpenGL implementations is to re-specify the data with glBufferData and a NULL pointer. Most drivers will recognize that in "throwing out" the contents of your buffer, you are separating the contents of the buffer for future ops from what is already in the queue for drawing. The driver can then allocate a second master block of memory and return that at the next glMapBuffer call. The driver will throw out your original block of memory later at an unspecified time once the GPU is done with it.

Alternatively, if you are on OS X or have a GL 3.0 extension, there are extension that let you check out and operate on a buffer with locking suspended, allowing you to manage sub-regions in your buffer independently.

Friday, August 06, 2010

Restarting the OS X Window Server for Fun and Profit

Well, it's not very profitable. Hell, it's not even that fun. But let's just say, hypothetically, that you were working on a flight simulator with an OpenGL rendering engine. And let's just say, to make this interesting, that if you crank up all of the new rendering engine options, sometimes it causes the OpenGL stack to completely lose its meatballs, and the resulting carnage renders the entire computer unusable.

(If you are having trouble imagining this, close your eyes and visualize a desktop where nothing but the mouse moves, but as you drag what were your windows, small pieces of your scene graph flicker in and out of what used to be your open windows, as if you were just showing random parts of video memory. Okay, maybe it is a little bit fun.)

Here's what you need to get your life back:

Have remote ssh enabled in the sharing control panel. ssh into your machine. Odds are, the remote shell is perfectly happy, even if the desktop looks like you hired Picasso as your art lead and he was extra high that day.
Kill -9 pid will bring back the desktop some of the time. That is, sometimes just killing off your app is enough to get your desktop back. Typically this is a win in the case where the driver is constantly resetting and you just can't use the UI because the reset cycle is slow.
If that doesn't work, this will kill off the entire window manager (including, um, everything...the Finder, your app, X-Code, icanhazcheesburger): sudo killall -HUP WindowServer

It beats a full reboot (by some marginal amount).

A Healthy Fear of Threading

Continuing in the line of pithy quotes:

There are only two kinds of programmers: programmers with a healthy fear of threaded code and programmers who should fear code.

Now I'm not saying "never thread". I'm just saying "you better be getting something good for that threading, because it's driving up your development costs."

In particular, the effective execution order of threaded code can change with every run, and there is no guarantee that you have seen every combination of execution order by running your program a finite number of times.

Thus methods of checking your code quality by running your program (perhaps many times) won't detect bugs in threaded code. You may not find out until that user with one more core and a background program chewing up cycles hits an execution order that you haven't seen yet.

Instead for threaded code you have to prove logically that the execution order constraints applied (via locking, etc.) create a bounded set of execution combinations, and that each one is correct. This isn't quick or easy to do.

One way we cope with this development cost in X-Plane (where we need to use threads to fully utilize multiple cores) is to use threading design patterns with known execution limits. The most common one is a message queue, where ownership of data access flows with the message down a queue. This idiom not only guarantees serialized access to data without locks, but the implementation in C++ tends to make errors rare; if you have the message you have the pointer, and thus you have rights on the data. If you don't have the message, you have nothing to dereference.

Sunday, July 18, 2010

How Does OpenGL Work?

If you are registered with Apple's developer sites, I strongly recommend the OpenGL and OpenGL ES video talks from WWDC 2010. The Apple engineers spell out in a fair amount of detail things that you had to infer previously, including:

How state information is accumulated and then resynchronized at draw call time.
How resources are synchronized and shared between host memory and the GPU.

The videos are in QuickTime format with subtitles, so you can play them back at 2x speed with captioning to get through the material faster.

Thursday, June 03, 2010

Interval Sets With the STL

I spent some time today working on an interval set. The basic idea of an interval set is to record a set of disjoint ranges that partition a number space into a finite "included" area and an infinite "excluded" area. Or to put it more simply, [3, 6) is an interval, and [3, 6) [8, 10) is an interval set.

Googling around for this I found a few ideas based on using a sorted map, with the interval beginning as the key and the interval end as the value. My approach is different, and is closer to the original implementation of Macintosh regions: a vector of beginning and ending pairs, e.g. { 3, 6, 8, 10 }.

I'm not sure whether this approach is superior to a map-based approach; I think I'd have to code each one all the way to completion. The vector does have a few advantages:

Compact storage, with minimal overhead.
Reading the sorted array can usually be done in linear or log-N time.

The basic rules for the interval set are:

Intervals are inclusive at the bottom and exclusive at the top. So the interval 3,6 includes the number 3 but excludes the number 6. (Thus given two intervals [3,6) and [6,9) the number 6 is included in exactly one of them.)
Intervals must have non-zero length (so [3,3) is illegal).
Intervals are finite - there is no notation to say "everything below 3" is included.

The heart of the algorithm is a "merge" operation. In a merge, the sequence of interval edges from two interval sets are traversed together (think of a merge sort) and each new smallest sub-interval is evaluated for inclusion by a boolean operation. This lets us perform a union, difference, intersection, or symmetric difference in linear time O(N+M) where N and M are the lengths of the vectors.

(The actual time is actually slightly worse because vector will need to periodically reallocate its memory during the creation of the new resulting vector. We could use a heuristic to pre-allocate some space at a loss of memory efficiency. If we used a set we could avoid memory costs, but we'd end up with O(NlogN) time to build the set anyway, and we'd pay node overhead, which is almost certainly worse than any extra on a vector of 32-bit floats or integers.)

When we search for an interval (using lower_bounds) we can tell whether we are "in" or "out" of the region by looking at whether the index of the returned region is even or odd - even regions are in the set and odd ones are outside of it.

The interval class is also heavily special cased for a number of optimizations:

Separate operators on pair allow for the processing of a single interval (rather than a set). When we know the single interval, we can take a number of short-cuts, and we can perform "in-place editing" using log-N searches into the original interval set.
Operations on sets can identify short cuts. For example, the intersection of two sets whose range is disjoint is always empty. (In other words, if the last value in A is less than the first value in B, intersecting A and B is an empty set.)

I haven't used the interval set class enough to profile it; real measurement will tell which of these optimizations is a win. One tricky aspect of the code is that vector is a leaky abstraction - it makes mid-vector insertion look cheap when really it is a linear operation (because all subsequent elements must be copied to their new locations).

As an example of why this might matter: consider symmetric difference (XOR) of an interval set and a single range. This operation can be computed simply by: deleting the range bounds from the set if they exist, otherwise inserting them. In other words, given the interval set [0,3) [6,9) [12,15) we can XOR this with the interval [6,8) by deleting 6 and inserting 8 - the new XOR is [0,3) [8,9) [12,15). This is a relatively fast operation: two log-N searches (for 6 and 8) and one delete and one insert.

Despite the simplicity of the algorithm, vector is going to require two mid-vector editing operations, so our average time complexity is O(N) - linear! (On average half the elements of the vector are after us, and we do two editing ops.)

For this reason, the special case of a disjoint XOR is special cased. If we XOR [-10, -8) into the above region, we can observe that -8 < 0, therefore the regions don't intersect, and -10, -8 simply needs to be pre-pended. This can be done with a single insert, and thus should run about twice as fast as a pair of individual inserts.