Saturday, May 07, 2016

Simultaneous Mipmap Level Generation

The novel trick deep in this post is the use of morton-number pixel addressing to generate mipmaps in parallel; I was surprised to not see anyone doing this so I figured I'd write it up. It is quite possible that no one does this because it's totally stupid, but the initial results looked promising under a narrow set of circumstances.

I was looking at CPU-based generation of mipmaps the other day. The traditional way to do this is with recursion: you calculate mip level 1 (quarter size of the original image) from the original image, then you calculate mip map level 2 from mip map level 1, again cutting down by a factor of four, and on and on.

This "recursive down-size" algorithm has a nice property: the total number of pixels read in the filtering operations is very low - about 33% more than the size of the original image. (We get this "efficiency" even though we are producing several new images because most of the mips are reading from much smaller images than the original.)

But recursively down-sizing has one hidden problem: the smaller mips are not actually computed from the original image! If you have a 256x256 image, that last pixel is computed from data that has been re-sampled seven time! Your lower mips are on the wrong end of a game of telephone.

Does this matter? That all depends on what your filtering algorithm is. If you are doing a simple box filter average, then it almost certainly doesn't matter - the limits of that filter are a lot worse than the recursive error.

But there are algorithms that make approximate mipmaps. A common trick these days in physically based rendering engines is to increase the roughness of lower mips based on the divergence of normals in the higher mips. The idea is that at lower mips our normals (pointing in all different directions) should be scattering light, but they've averaged out. We increase roughness to say "some scattering happened under this pixel that you can't see any more."

The problem is that this roughness increase is very much an approximation; the last thing we want to do is approximate an approximation over and over by recursively down-sizing. Whatever the problems of the algorithm, let's limit ourselves to absorbing the error once by always working from the highest level mipmap.

Two Ways To Sample

A naive way to compute these mipmaps is to compute each mip level from the original image, with each higher number (and smaller) mip level sampling more source pixels for each destination pixel. This solves our quality problem, but it's slow - for a 256 x 256 image, we're going to do eight full passes over the source image - each time we make the source image twice as big, we eat another pass. This is way worse than 33% more samples that we had in the recursive algorithm.

(Quick side note: note that overflow must be re-examined. Recursive mipmaps sample at most four source pixels for each destination, so for example we could use a short to accumulate unsigned bytes. The "use the source" image might sample 1M pixels to build the last mip, overflowing accumulation data types.)

I then got an idea: what if we could build all of the mipmaps at once in a single iteration? The algorithm would go something like this:

  • For each mip level we will compute, start a "bucket" to accumulate pixels.
  • For each pixel in the source image, read that pixel and accumulate it into each bucket.
  • For each of those buckets, if it is full (that is, if it has accumulated enough samples to be done), compute the destination pixel and reset the bucket.
For the first mip, we are going to empty our bucket and move on every four samples; for the last level mip (1x1) the bucket empties only once at the very end.  Each bucket is an accumulation of the exact right number of source pixels, and each source pixel is read only once.

In order to make this work, we have to use a very particular iteration order: we have to cover all four pixels under the first pixel of the first mip before we can go on to the fifth pixel - this applies at all mip levels. It turns out that this is done by -interleaving- the digits of our pixel address in binary.

(If you are familiar with tiling and GPU hardware this is totally unsurprising in any way.)

In other words, if we are doing the 87th pixel of a 256 x 256 image, that pixel number is 01010111 in binary.  We split the digits to get an X address of 1111 and a Y address of 001, that is, this is the 15th X pixel and 1st Y pixel. The first 64 pixels are in the lower left 8x8 of the image - they must be because we have to read those first 64 pixels to compute the 3rd mip-level's first pixel.  We then move to the right by 8 pixels - the next 16 pixels are used for the next 4x4, so we move four more to the right.  The next four pixels are the next 2x2, leaving 3 pixels left at 14,0 - in that 2x2 we are in the upper right, giving us an address of 15,1.

(Note: if your image is not square, all bits to the left of the number of bits needed to compute an NxN square, where N is the shorter dimension, must be applied only to the longer dimension.)

There is one problem with this algorithm: it is really, really, really cache unfriendly. The reason it is so cache unfriendly is that our image is stored in linear order, not tiled order, so all of those Y hops are jumping way forward and backward and all over the place in memory.  Our images are too big to just sit in cache (e.g. 4 MB) but in our old algorithms we just read through them in perfect order. That meant that that for each cache line fetched, we used the whole thing, and more importantly, our access order was so bleeding obvious that the hardware prefetcher could know exactly what was going on and have the next bytes ready.

To performance test this, I put my down-sampling algorithms into traits classes and wrote templated down-samplers for recursive, linear, and parallel mipmap generation. To test, I used two down-sampling algorithms:

  • "Raw" down-sampling simply averaged pixel values, as integers.
  • "sRGB" down-sampling converted the pixels to floating-point linear space, averaged in linear space, and then converted to sRGB.  The sRGB conversion is a correct conversion (e.g. linear at the bottom, 2.4 power on the top) in floating point, rather than a table lookup or approximation.
Here's the numbers, in ms for a 2048 x 2048 RGBA image:
            Raw     sRGB
Recursive   18.4    374
Sequential  72.1    2732
Parallel    124.9   392

As you can see, recursive is always fastest, and parallel is worse than sequential for purely data-movement workflows like "raw".  But when we go to the sRGB work-flow, parallel is much faster than sequential, and almost as good as recursive.

Why? I think the answer is that the conversion from 8-bit sRGB to floating point linear is expensive - there's branching, power functions and data unpacking in that conversion.  And the parallel algorithm absolutely minimizes this work - every source pixel is decoded exactly once - that's even 33% less than the recursive algorithm.  (Every algorithm has to write the same number of pixels in the end, so there's no change in absolute work then.)

My gut feeling is that this "win" isn't as good as it looks - the more data limited and the less CPU limited our algorithm, the more like 'raw' and the less like 'sRGB' we are.  In other words, there's no real replacement for recursion if you can afford it, and they wouldn't look close if I spent some time tuning my sRGB conversion.* And if there's any long term trend, it's code becoming more input data limited and less ALU bound.

You Know We Have GPUs, Right??

At this point you should have only one question: Ben, why in the hell are you not using the GPU to compute your mipmaps???

There are a few conditions particular to the X-Plane engine that make on-CPU mipmapping interesting in X-Plane.
  • Most of our mipmaps are precomputed in DDS files. Therefore most of the time, our interaction with the driver is to feed the entire mip pyramid to the driver from the CPU side. By computing mipmaps for PNG files on the CPU, we ensure that every one of our textures goes into the driver in the same way. 
  • Since we are OpenGL based, the driver needs to manage residency of textures; if the driver decides to prioritize PNG files because they are more expensive to swap off the GPU because the mipmaps aren't already resident in system memory too, that's not good for us.
  • Some day when we use a lower level API, we will need to manage residency, and it will be easier to have only code flow (system-pool for all resources, page them to the GPU when they are in the working set) for all textures.
  • Disk-texture upload is done entirely in background threads - the sim never waits, and pre-loading is N-way threaded during load, so the cost of CPU time is actually pretty cheap in a multicore world.
Anyway, if it turns out that parallel mip generation is "a thing" and my Google Fu is just weak, I'd be curious to hear about it.

* I did not try a table-based sRGB sRGB decode, which would be an interesting comparison - I'll have to try that on another day.  I did try RYG's SSE + table-based re-encode, and it does speed things up compared to a non-SSE implementation. I was able to modify it for alpha support but haven't had time to write a matched decoder.  (I didn't see any back story to the post, so I don't know if decode should always just be table-based.)

Friday, January 08, 2016

Work Stealing and Lock Free Chaos

Pablo Halpern explains work stealing at cppcon 2015:

A good introduction to more advanced parallel task scheduling strategies.

X-Plane's task scheduler for multi-core work is a central work-pool, similar to libdispatch and pretty much any basic worker-pool design that isn't trying to get too clever with continuations.  Our assumption is that the amount of background work is (hopefully) relatively large; typically the work is "cold" when it's scheduled and is going to a core other than the one running the main rendering lop.

And Fedor Pikus on lock-free programming.  There's some mind-bending stuff in here. For example, is your concurrent FIFO really a FIFO when used with multiple threads?  No one knows, because the only way to test it (from a correct race free program) disallows testing concurrent ordering.

(I view that as a win - one more thing I don't have to code...)

One good point from the talk:
Design only what you need, avoid generic designs.
Avoid writing generic lock-free code.
In other words, making a fully general lock free thingie with no fine print and no strings attached is really, really hard. After you spend more time engineering your super-data-structure than you should have, you will end up with something that has subtle concurrency bugs.

Instead, look for chances to simplify the problem in the application space.

An example from X-Plane: the APIs around our art assets are not lock-free. But the API to use them is lock-free, and that's actually all we care about.  Loading/unloading happens in the background on worker threads and is always asynchronous.  It has to be, because the actual work of loading/unloading the art assets is often non-trivial.

But once you have a reference count and pointer to an art asset, you can use it with no further checks; the reference count guarantees it isn't going anywhere. This means the rendering loop can proceed and never gets locked out by loading threads, which is what we care about.

This isn't a "fair" design, and it's not lock-free or wait-free, but it's unfair in exactly the way we need it to be and it's faster on the path that actually matters.

(It also has a subtle race condition due to an implementation error...but that's for another post.)

Saturday, December 19, 2015

The Dangers of Super Smart Compilers

For the first time today, I ran an optimized DSF render using RenderFarm (the internal tool we use to make the global scenery) compiled by Clang.
The result was a segfault, which was a little bit surprising (and very disheartening) because the non-optimized debug build worked perfectly, and the optimized build works perfectly when compiled by GCC. When -O0 revealed no bug (meaning the bug wasn’t some #if DEV code) it was time for a “what did the optimizer do this time session.”
After a lot of printf and trial and error, it became clear that the optimizer had simply skipped an entire block of code that went roughly like this:
for(vector<mesh_mash_vertex_t>::iterator pts = 
   ioBorder.vertices.begin(); pts != 
   ioBorder.vertices.end(); ++pts)
if(pts->buddy == NULL)
   /* do really important stuff */
The really important stuff was being skipped, and as it turns out, it was really important.
So…WTF? Well, buddy isn’t a pointer - it’s a smart handle, so operator== isn’t a pointer compare it’s code. We can go look at that code, let’s see what’s in it.
The handle turns out to just be a wrapper around a pointer - it’s operator* returns *m_ptr. Operator== is defined out of line and has a case specifically designed to make comparison-with-null work.
  template < class DSC, bool Const >
  bool operator==(const CC_iterator<DSC, Const> &rhs,
                  Nullptr_t CGAL_assertion_code(n))
    CGAL_assertion( n == NULL);
    return &*rhs == NULL;
Of course, Clang is way smarter than I am, and it actually has commentary about this very line of code!
Reference cannot be bound to dereferenced null pointer in well-defined C++ code; comparison may be assumed to always evaluate to false.
Oh @#. Well, there’s our problem. This operator==, like plenty of other semi-legit code, is “unpacking” the handle wrapper by using &* to get a bare pointer to the thing being wrapped. In practice, the & and * cancel each other out and you get the bare pointer that is secretly inside whatever you’re working with.
Except that Clang is sooooo clever. It goes “hrm - if &*rhs == NULL then what was *rhs? It’s a NULL reference (because rhs is NULL and we dereferenced it). And since NULL objects by reference are illegal, this must never have happened - our code is in undefined behavior land as soon as *rhs runs.
Since our code is in undefined behavior land (if and only if *rhs is a “null object” if such a thing exists, which it doesn’t) then the compiler can do whatever it wants!
If *rhs is not a NULL object, &*rhs won’t ever equal NULL, and the result is false. So if one side of the case returns false and the other side is undefined, we can just rewrite the whole function.
  template < class DSC, bool Const >
  bool operator==(const CC_iterator<DSC, Const> &rhs,
                  Nullptr_t CGAL_assertion_code(n))
    return false; /* there I fixed it! */
and that is exactly what Clang does. Thus if(pts->buddy == NULL) turns into if(false) and my important stuff never runs.
The short term “fix” (and I use the term loosely) is to do this:
for(vector<mesh_mash_vertex_t>::iterator pts = 
   ioBorder.vertices.begin(); pts != 
   ioBorder.vertices.end(); ++pts)
if(pts->buddy == CDT::Vertex_handle())
   /* do really important stuff */
Now we have operator== between two handles:
  template < class DSC, bool Const1, bool Const2 >
  bool operator!=(const CC_iterator<DSC, Const1> &rhs,
                  const CC_iterator<DSC, Const2> &lhs)
    return &*rhs != &*lhs;
This one is also doing illegal undefined stuff (&* on a null ptr = bad) but Clang can’t tell in advance that this is bad, so the optimizer doesn’t hammer our code. Instead it shortens this to a pointer compare and we win.
Newer versions of CGAL* have fixed this by taking advantage of the fact that a custom operator->() returns the bare pointer underneath the iterator, avoiding the illegal null reference case. (This technique doesn’t work in the general case, but the CGAL template is specialized for a particular iterator.)
In Clang’s defense, the execution time of the program was faster until it segfaulted!
  • You can make fun of me for not updating to the latest version of every library every time it comes out, but given the time it takes to update libraries on 3 or 4 compilers/build systems and then deal with the chain of dependencies if they don’t all work together, you’ll have to forgive me for choosing to get real work done instead.

Thursday, December 10, 2015

Source Control for Art Assets - This Must Exist

I've been thinking a lot lately about revision control for art assets. As X-Plane has grown, our art team has grown, and as the art team has grown, our strategy for dealing with art assets is coming under strain.

Currently we use GIT for source code and SVN for art assets in a single shared repo. No one likes SVN - it was selected as the least bad alternative:

  • Since it's centralized, it's much more in line with what artists expect for revision control - no explaining distributed source control to non-programmers.
  • It doesn't replicate the entire history of an art asset, which is too much data.
  • Parts of a tree can be checked out without paying for the entire tree.
  • There are decent GUIs for every platform.
  • It's scriptable for integration flexibility.
SVN still has some real problems:

  • It is just so slow. You can look at your wire speed and SVN's speed and you're just not getting a fast transfer.Update: this finding is wrong! SVN's speed at transferring binary files is about the same as your wire speed to the server. I'll write up a separate post on speed tests. Many of us are using GUI clients and it is possible that some of them are adding a tax, but the command line SVN client is similar in up/down transfer speed to GIT and rsync for basic data transfer.
  • SVN can't do an incremental update without a working repo, which means having a .svn directory even for the art assets you're not working on. That means at least 2x the disk space on the entire art asset pile, just to be able to get latest.

GIT's Not It

Since I am a programmer, my first thought was: well, clearly GIT can be made to do this, because GIT is the answer to all problems involving files. I spent some time trying to figure out how to shoe-horn GIT into this roll and have concluded that it's not a good idea. GIT simply makes too many fundamental assumptions that are right for source trees and wrong for art asset piles. We'd be fighting GIT's behavior all of the time.

We Kind of Want Rsync

There are two parts of art asset version control: letting the guys who are doing the work make revisions, and letting the people not doing the work get those revisions. It's easy to overlook that second task, but for any given person working on X-Plane, that artist is not working on most of the airplanes, scenery packs, etc.  And the programming team is working on none of them.

For the task of getting art without revision control, rsync would be just great.

  • It can work incrementally.
  • It only gets what you need.
  • It's reasonably fast.
  • It doesn't waste any disk space.
One of the main problems with SVN is performance - if I have to change a branch, having SVN take half an hour to get the new art asset pack I need is pretty painful. So it's at least interesting to look at the architecture rsync implies:

  • Files live on the server.
  • We fetch only the files we want.
  • We basically do a straight network transfer and we don't try anything to clever.
Hrm....I know another program like that.

We Kind of Want The X-Plane Installer/Updater

We solved the problem of getting the latest art assets for all of our users - it's called the X-Plane updater. In case you haven't spent your copious free time wire-sharking our updater, it's really, really simple:

  • All files live on an HTTP server, pre-compressed.
  • A manifest lives on the HTTP server.
  • The client downloads the manifests, compares what it has to what's on the server, then fetches the missing or newer files and decompresses them.
Our installer is (sadly) not content-addressed (meaning a file's name is what is inside it, which naturally removes dupes). If I could redesign it now it would be, but in my defense, GIT wasn't a round when we did the original design. (As a side note, it's way easier to debug server side problems when you are not content addressed. :-)

But we can imagine if it was. If it was, we wouldn't keep a fresh mirror of every version of X-Plane on the server - we'd just have a big pool of content-addressed files (a la GIT) and fetch the subset we need.

Let's Version Control the Manifest

So naively my thinking is that all we need to do is version control our file manifest and we have our art asset management solution.
  • Each atomic revision of a version-controlled art asset pack (at whatever granularity that is) creates a new manifest describing exactly what art assets we have.
  • Art assets are transferred from a loose file dump by syncing the manifest with the local machine.
Here's what is interesting to me: we could use pretty much any source control system and get away with it, because the manifest files are going to be relatively small.

Does This Really Not Exist

I feel like I must be missing something...does a tool like this not already exist?  Please point me in the right direction and call me an idiot in the comments section if someone has already done this!

Importance Sampling: Look Mom, No Weights

For anyone doing serious graphics works, this post will be totally "duh", but it took me a few minutes to get my head straight, so I figure it might be worth a note.

Fair and Balanced or Biased?

The idea of importance sampling is to sample a function in a biased way, where you intentionally bias your samples around where most of the information is. The result is better leverage from your sampling budget.

As an example, imagine that we want to sample a lighting function integrated over a hemisphere, and we know that that lighting function has a cosine term (e.g. it is multiplied by the dot product of the light direction and the normal.)

What this means is that the contributing values of the integration will be largest in the direction of the normal and zero at 90 degrees.

We could sample equally all around the hemisphere to learn what this function does. But every sample around the the outer rim (90 degrees off) of the hemisphere is a total waste; the sampled function is multiplied by cos(90), in other words, zero, so we get no useful information. Spending a lot of our samples on this area is a real waste. Ideally we'd sample more where we know we'll get more information back (near the normal) and less at the base of the hemisphere.

One way we can do this is to produce a sample distribution over the hemisphere with weights. The weight will be inversely proportional to the sample density. We come up with a probability density function - that is, a function that tells us how likely it is that there is information in a given location, and we put more samples where it is high, but with lower weights.  In the high probability regions, we get the sum of lots of small-weight samples, for a really good, high quality sampling. In the low probability region, we put a few high weight samples, knowing that despite the high weight, the contribution will be small.

You can implement this by using a table of sample directions and weights and walking it, and you can get just about any sampling pattern you want.  Buuuuuut...

Lighting Functions - Kill the Middle Man

With this approach we end up with something slightly silly:
  1. We sample a lighting equation at a high density region (e.g. in the middle of a specular highlight).
  2. We end up with a "strong" lighting return, e.g. a high radiance value.
  3. We multiply this by a small weight.
  4. We do this a lot.
In the meantime:
  1. We sample a lighting equation in a low density region.
  2. We end up with a very low radiance value.
  3. We multiply it by a heavy weight.
  4. We do this once.
Note that the radiance result and the weight are always inverses, because the probability density function is designed to match the lighting function. The relative weight of the brightness thus comes from the number of samples (a lot at the specular highlight, very few elsewhere).

We can simplify this by (1) throwing out the weights completely and (2) removing from our lighting equation the math terms that are exactly the same as our probability density function.  Steps 2 and 3 go away, and we can sample a simpler equation with no weighting.

Here's the key point: when you find a probability density function for some part of a lighting equation on the interwebs, the author will have already done this.

An Example

For example, if you go look up the GGX distribution equation, you'll find something like this:

GGX distribution:
float den = NdotH * NdotH * (alpha2 - 1.0f) + 1.0f;
return alpha2 / (PI * den * den);
That's the actual math for the distribution, used for analytic lights (meaning, like, the sun).  The probability density function will be something like this:
float Phi = 2 * PI * Xi.x;
float CosTheta = sqrt( (1 - Xi.y) / ( 1 + (a*a - 1) * Xi.y ) );
float SinTheta = sqrt( 1 - CosTheta * CosTheta );
(In this form, theta of 90 points at your normal vector; Xi is a 2-d variable that uniformly samples from 0,0 to 1,1. The sample at y = 0 samples in the direction of your normal.)

Note that the probability density function contains no weights. That's because the sample density resulting from running this function over a hemisphere (you input a big pile of 0,0 to 1,1 and get out phi/theta for a hemisphere) replaces the distribution function itself.

Therefore you don't need to run that GGX distribution function at all when using this sampling. You simply sample your incoming irradiance at those locations, add them up, divide by the samples and you are done.

Doing It The Silly Way

As a final note, it is totally possible to sample using a probability density function that is not related to your actual lighting equation - you'll need to have sample weights and you'll need to run your full lighting equation at every point.

Doing so is, however, woefully inefficient. While it is better than uniform sampling, it's still miles away from importance sampling with the real probability density function replacing the distribution itself. 

Saturday, November 21, 2015

Blender Notepad - Eulers

When Blender describes a rotation as an 'XYZ' Euler (with 3 angles), this is what they mean:
  • The Z axis is "up" (the Y axis is away from us to be right-handed).
  • Each rotation is around the named axis.  So X is a rotation around the X axis (a "pitching up" rotation for pilots).
  • The rotations are done in the order listed, extrinsically. In other words, we rotate around each of these global axes.
The net result of this is that the X rotation is affected by the Y and Z (because they happen later).  If we were rotating around the rotated Y (Y') or rotated Z (Z'') axis, then the X axis would be unaffected.

The net result is that (from an aviation-angles perspective) we do yaw first (global Z is unaffected), then roll (transformed Y), then pitch (transformed X).  (It should be noted that with pitch last, this does not even remotely correspond to how pilots think about these angles.)

To match X-Plane's transform, instead of XYZ, we need (in Blender) YXZ, which puts Y (roll) at lowest priority.

How the 2.49 Exporter Goes to Crazy Town

Blender 2.75 lets you select orientations; 2.49 is always in XYZ mode.  Since these are global axes, the correct order to apply them in an OBJ is:
ANIM_rotate 0 0 1
ANIM_rotate 0 1 0
ANIM_rotate 1 0 0
That is, apply Z first, since X-Plane only has local transforms.  (That is, in X-Plane, the last animation is affected by the prior two.)

When the Blender 2.49 exporter decomposes rotations into Eulers, it goes in this order, but it does so in X-Plane coordinates.  Thus while "yaw" is unchanged in XYZ animation in Blender, "roll" is unchanged in the export.

Friday, November 20, 2015

SASL Crash on El Capitan - the Gory Details

I'm trying to not clog up the X-Plane developer blog with tons of technical C++ details. There are a small number of developers who actually want to know those details, so I'm going to post them here. This post explains why SASL was crashing on plugin-unload on El Capitan (but not older operating systems).

Both SASL and Apple's OpenAL implementation are open source, so despite this being a bug that was totally not in the X-Plane code base, I was able to look at everyone involved and debug it myself. I am not particularly happy about having to do that, but the symptoms of the bug were:
  • Upgrade to El Capitan for free - why not, new things are shiny.
  • Run X-Plane - seems okay!
  • Run SASL plane - seems okay!
  • Switch from SASL plane to plane that ships with X-Plane. Oh noes - my sim crashed! Report a bug to Laminar Research.
The back-trace from the Apple crash reports were all very clear: X-Plane was unloading SASL, SASL was asking OpenAL to tear down its audio context, and OpenAL was throwing an uncaught exception.

So I got involved because users thought this was our bug, even though it wasn't.

Hrm - new crash in Apple's framework in a new OS. Blame Apple! Except, no other OpenAL code is crashing.

Apple's Bug

It turns out there is a bug in Apple's OpenAL. It's one that has been in there for a long time, but only shows up in El Capitan, and frankly doesn't matter in any real way. On OS X, if you call alcDestroyContext on a context that has (1) playing sounds and (2) is the only context for its device and (2) isn't using effects on those sounds then you get an uncaught exception on El Capitan.

The actual bug is subtle - the tear-down order of the underlying audio units that power Apple's OpenAL implementation isn't quite right in this case, resulting in AudioUnits returning an error code in a destructor.  The code throws this and catches it in the underlying alcDestroyContext call.

From what I can tell, there was a tool chain change in El Capitan that causes this to terminate an app. I am not an expert, but I think that throwing an exception out of a destructor is undefined behavior, and now Clang is putting its foot down. When I compiled OpenAL from source, my built version simply caught the exception and returned it from alcDestroyContext.

For what it's worth, I don't consider this a severe bug or engineering failure by Apple. The OpenAL specification is a total disaster, and I don't blame anyone who misses a corner case (assuming deleting a playing context even is legal - with a spec like that, who knows). And no app in its right mind would just go kill the context without stopping audio first. Which brings us to SASL's bug.

SASL's Bug

SASL had a bug too. SASL uses a stack based C++ class to change the OpenAL audio context from X-Plane's context to its own to do audio work and then turn it back when done. This is a classic RAII way to manage state.
ContextChanger changer(sound->context);
Except the clean-up code in SASL had this:
That is, of course, totally legal C++, and totally not useful. I look forward to the day when creating a temporary object in its own expression with a non-trivial destructor is a warning, because I've done this in my own code too.

Without a working context changer, SASL's cleanup code would attempt to clean up all of X-Plane's audio objects (not cool man, not cool!) and then kill its own context. Of course, its own context was still playing since no cleanup had happened.

To put it bluntly, this bug makes me pretty mad, and here's why:
  • This code has literally never worked right. Not once, not since day one.
  • The fact that this code was not working right was easily detectable just by checking the OpenAL error code. When SASL goes to delete sources in the wrong context, in most cases the source names are wrong and OpenAL returns an error code. During development and in debug mode, SASL should be checking the OpenAL error code, at least when it finishes its own work before returning control to X-Plane.
Unfortunately, before this bug was fixed, SASL contained only one bit of "error checking code":
ContextChanger(ALCcontext *context) {oldContext = alcGetCurrentContext();alcMakeContextCurrent(context);alGetError();};
If you don't speak OpenAL, basically that's SASL clearing the error code before beginning audio work, with no check of what is in there. This is not how to do error checking.

The Fix Is In

The good news is that the newest version of SASL (2.4 as of this writing) fixes the context changer bug, and also in some cases checks the OpenAL error code after issuing OpenAL commands. The error checking is not as complete as I'd like to see, and still will silence the error sometimes, but it's a step in the right direction.

Are there any teachable moments here? I think there are a few:
  • If an API provides return codes* for the purpose of determining program correctness (E.g. OpenAL returning "invalid source") it is absolutely to leverage those return codes to do debug assertion checking.
  • It is not good enough to run the code and observe expected behavior at the user level - you need to verify that the code is actually doing what you expect, or you don't know. (A very wise senior engineer once told that to me 21 (!) years ago when I was just an intern at Avid's taken me about that long to deeply understand this in my gut.)
  • Any time the behavior of code isn't going to be directly user observable (which includes pretty much all resource cleanup code), you need to design the system for debug-ability, e.g. create test cases, attach the debugger, put logging in place, put assertions in place. Proving a program is correct and debugging it is a design requirement just like functionality.

* I don't want to use the term error codes for these returns because I think it is important to distinguish between mistakes in program correctness (you, the programmer, screwed up) and expected failures of hardware (e.g. a disk read error). Having a return enumeration from a function is a coding idiom that can be used for either of these cases. In the case of OpenAL and OpenGL, the returned code detects both programmer mistakes and underlying "errors", e.g. exhaustion of memory.