Thursday, April 24, 2008

The Über-Queue

In redesigning the background processing code for X-Plane, I am revisiting message queues.

Requirements on Queue Design
  • Anyone can be a worker thread. To be a worker thread, you simply pull off a task and execute it. (But see barriers below.) The important thing here is that we can have more than one worker thread for one giant queue, so that dispatch models CPU cores rather than categories of work.
  • Anyone can queue work - no requirement to be the main thread (just about any message queue design does this).
Problems Left for Client Code
  • Client code must not delete objects that are required by pending work tasks (scheduled or running). Doing so would be, um, bad.
  • Client code cannot halt running work tasks - doing this would introduce a huge amont of complexity for very little value.
  • Client code must use its own mechanism for getting "receipts" from finished tasks.
To this last point, typically while we want to schedule everybody in one big queue and have all hardware work on tasks, we want to dispatch our results to disparate locations in the app in a very client-code-specific way. In our case, we'll typically put confirmation code in the specific task objects, such as sending a message to our parent subsystem saying we're finished.

Queue Editing

While it's not strictly not strictly necessary, we provide an atomic editing operation on our message queue - basically a client can go in and reorder pending tasks. There are a few important uses for this:
  • It allows for prioritized queueing - I can queue a work task and then edit the queue to put my task first.
  • I can find "stale" work objects and remove them (replacing them with no-ops). Normally if I need to delete an object that is referenced by a pending task, I need to wait for the task (and have some way of knowing that the task has completed, specific to the task). But if I can clean the queue first I can remove unnecessary scheduled tasks and prevent work on soon-to-be-dead objects.
  • Being able to place a task in a specific place is useful for barriers.

Barriers are actually implemented as clients on top of the scheduling system - their only requirement is to know the exact number of worker threads. (This imposes less generality on the worker thread pool.)

A barrier is implemented as a pair of message queues (or even semaphores, since all we care about is counting) and one work task for each thread.

We use the barrier to halt thread processing by inserting all of the work tasks into the queue; each one effectively signals back to us and holds up that worker. We thus wait for four receipts from the work tasks indicating that all four workers are now waiting on us. At this point, we are running and background processing is halted.

To resume processing, we send messages back to the worker threads and wait again. They acknowledge and are then finished. They send an acknowledge back to us because we may want to deallocate the resources used in the barrier and cannot do so until we know that all worker threads have woken up and finished their code.

We can use barriers when we need a true halt of processing; we can also use them as markers in the stream to indicate that we are done with a range of tasks.

If we want to wait for only one task to be done, we're better off having it signal us directly. The power of barriers is that, because they halt all processing, we can be sure that all tasks before us have finished. But the real purpose of barriers is to halt the processing of tasks that are behind us.

No comments:

Post a Comment