Friday, September 11, 2009

libdispatch vs. the Uberqueue

In a previous post I commented on Apple's new thread pool/async calling system, Grand Central Dispatch. Apple has opened the source to libdispatch, the brains behind it.

So first of all, if libdispatch and X-Plane's thread pool got into a fight, libdispatch would thoroughly kick our asses. The X-Plane thread pool is a simple cross platform message queue and pool system - totally CS101. libdispatch is highly performance tuned, provides a wide variety of dispatch semantics, and tries a lot harder to be low overhead.

I'm feeling okay about that. The X-Plane thread pool took about two days to code, and pretty much just works the way we want. In particular, the overhead may be higher than libdispatch, but we know what we put in our thread pool, and we know our work items are fairly big; thread pool overhead doesn't show up in a profile.

There is one trick that we can do that libdispatch doesn't do (and probably shouldn't): inspecting the work queue. Our pool lets a client "edit" the queue, going in and pulling out any queued items that it has a grudge against. The main thread pool is locked for editing, so editing has to be fast and worth it.

Why allow such a weird operation? Well, X-Plane's work items are fairly expensive. And there is one case where this is a win. When it comes time to throw out a whole hunk of scenery, we need to wait until all of its enqueued mesh-building work tasks are done. We also, coincidentally, happen to throw out scenery at the same time that we shift the coordinate system, which is something you don't want to do while meshes are being built. The whole fire drill goes something like this:
  1. Insert a "barrier" into the queue. This is a custom object that basically clogs all of the worker threads. Once the barrier is in place we know that our worker threads are idle.
  2. "Edit" the queue removing any enqueued (but not running) tasks that belong to scenery that is going to get thrown out. This is strictly an optimization, but a nice one.
  3. Wait for the barrier to fall into place. Basically we need to wait for each core to finish whatever it was doing.
  4. At this point, the scenery system is "frozen". We can throw out any old scenery, and shift the coordinate system. (In fact, our barrier let's us reuse the hijacked, sleeping worker threads, so they do the shift.
  5. Pull the barrier out and let things resume as normal.
The win of "editing" out in-flight scenery asks is that it makes management of step 4 a lot easier. When we throw out old scenery, how do we know that we don't have pending tasks based on that mesh, tasks that would be operating on deallocated memory if we throw out their scenery underneath them? There are three ways to manage this:
  1. Ref-count all of the scenery. The "tasks" to operate on the scenery add a ref-count, effectively deferring deallocation of each mesh until it's task has completed. This sucks for one reason: memory is usually rather tight, and we might end up "ballooning" memory as new scenery is loaded while old scenery hasn't gone down to zero references and been thrown out.
  2. Simply wait for all of the tasks we care about to complete before we throw out old scenery. If the shift and load is waiting on this (see above about memory) then this sucks too - our tasks might be at the back of a very long queue.
  3. Option 3 is our "editing" trick. By first inserting the barrier and then editing out any tasks in the queue, we guarantee that by the time our barrier stops threaded operation, every task referencing the scenery we want to throw out has either been (a) run to completion or (b) edited out of the queue before it could run.
Final thoughts: a lot of this voodoo is necessary only because our coordinate system 'shifts' in a big global nasty ugly operation. I have been researching a way to run on a single coordinate system...the short of it is that it looks like we'd need programmable vertex shaders as minimum hardware to do this - someday but not just yet. Once we don't have to shift, all scenery operation could be truly async, and a shift becomes a matter of dependent async tasks:
  1. When we detect that we'd like scenery to shift, we mark the tiles we don't like as "dead" and tag each one with a completion task - the tile we'd like loaded when it is destroyed.
  2. The dead tile stops queuing async work.
  3. When the last task for the tile completes, it signals the tile, which queues itself for an asynch-destroy operation.
  4. When the tile destroys itself the last thing it does is queue an async load of the new tile.
What's cool about this "chaining" process is that it guarantees in-order sequential operation with respect to memory, but each dead tile and replacement tile might be doing this on the fly as quick as they can, with minimal access to the main thread.

2 comments:

  1. Apparently (I'm not a C coder, so I might have read that wrong), libdispatch does provide a way to cancel tasks, but you have to cancel a whole dispatch source (which destroys all its non-running blocks, then waits for any running block to finish and calls the cancellation handler if any).

    In your case, that would probably mean associating a source to each scenery to render, and cancelling that source as part of throwing out/destroying a scenery.

    ReplyDelete
  2. You can simulate this kind of thing. Invent a generation count for your data. At the start of each block check the current generation count vs. the generation count that was current when the block was queued. If it has changed then you abort the operation.

    That gets you "step 2", but it doesn't let you know when each and every queue that you had sent data to has reached the stuff it can skip.

    You can use dispatch groups to know when a set of work items have completed, but that isn't the same as "ok nobody is now working on the old data, safe to remove".

    If you don't interlace things that do and don't depend on the generation count AND there are few enough queues with work in them that might be good enough

    However you can't always make that assumption.

    You could have BOTH a generational count, and a refcount. Grab a reference, then check the generation count, if it matches do the work. drop the reference after. So any item currently operating on your data will have a reference, but work merely queued for the future will NOT. You can actually use a dispatch group for the reference count (dispatch_group_enter is "get reference", when you bump the generation count you install a handler for the group completing and that gets to release the data...or you can block your thread).

    I'm pretty sure that will manage to let you release your data just as quickly as your existing system.

    It seems like a lot of mess, but you can bundle it up in a macro (or 3), or you could invent your own _async/_sync like API that wraps the blocks passed in your own block that does the appropriate refcount management and generation checking.

    ReplyDelete