Monday, September 14, 2009

You May Not Be Able To Update A Texture In A Thread

It took me a while to understand why this was the case - thanks to Chris for helping me wrap my head around it. Here's the deal: you may not be able to update an existing texture from a thread without getting "white flashes". I discovered this the hard way while working on X-Plane's texture pager but only recently understood why the OpenGL specification makes this so.

In a previous post I described the technique to update OpenGl textures on a thread by creating a new texure, flushing, and then using an atomic operation to "swap" the active texture ID for the rendering engine.

But this begs the question: why not just use glTexSubImage2D to update the texture on the fly and save all of that effort?

So first: if you can use glTexSubImage2D and you don't need mip-mapping, you can, maybe. You won't be able to guarantee which texture the main rendering thread uses unless you are willing to block the main thread (which you don't want to do). It's possible that the main thread could grab an old version.

I say maybe because: the gl spec allows you to do this kind of on-the-fly update for operations that change the contents of memory but not the layout of memory. So you are counting on the GL to not reallocate memory when you sub-texture. One would hope this is the case, but I have never tried it on Windows. I don't think the spec is very clear about when texture operations change memory.

The gl spec also requires glFinish (or the newer fence/sync APIs in OpenGL 3.2) to guarantee your changes take effect. In practice you really only need a flush on current drivers, but I don't know if this will continue to be the case - the flush is enough because current drivers are serialized at the command-to-GPU level, but one would almost hope that this bottleneck goes away.

Before we're done with glTexSubImage2D, we must note that your mip-maps can't all be updated at once unless you are using some kind of auto-mipping extension, so if you manually push mipmaps then you might have a case of inconsistent texturing due to old and new mipmaps being mixed.

What if you can't use glTexSubImage2D? (For example, X-Plane won't use glTexSubImage2D in about 16 different cases to work around driver bugs we've hit over the years.) Then you have a real problem.

The problem is that a texture object is not complete until all of its mip-map levels are complete and its texture parameters have been set. Until then, using it will give you white textures.

But you cannot atomically specify the entire texture object without blocking the main thread.

This is exactly what went wrong with my original implementation of on-the-fly texture updating: I would re-texture the largest mip level (using glTexImage2D) which promptly invalidated the texture, and then the main thread (which was not blocked) would flash white until the rest of the mip-map could be rebuilt.

It is for this reason that I suggest the atomic-operation double-buffering technique. It lets you build your mip-map levels in peace and guarantees consistent rendering, e.g. you always use the old or new texture.

As a final note, my implementation will delete the old texture from a thread after the atomic swap. This is a little bit dangerous if I read section D.1.2. of the spec correctly. From what I can tell, the object should still be valid to use if it is bound (e.g. until I unbind everywhere, we're okay) but the texture ID can be recycled quickly, so theoretically it might get updated mid-use, which would be undesirable.

If the goal is to guarantee a non-blocking rendering thread, a better design would be to "send" the dead objects to the main thread via message queue where they can be deleted in between frames; this would let memory bubble up just a little bit more.

Friday, September 11, 2009

libdispatch vs. the Uberqueue

In a previous post I commented on Apple's new thread pool/async calling system, Grand Central Dispatch. Apple has opened the source to libdispatch, the brains behind it.

So first of all, if libdispatch and X-Plane's thread pool got into a fight, libdispatch would thoroughly kick our asses. The X-Plane thread pool is a simple cross platform message queue and pool system - totally CS101. libdispatch is highly performance tuned, provides a wide variety of dispatch semantics, and tries a lot harder to be low overhead.

I'm feeling okay about that. The X-Plane thread pool took about two days to code, and pretty much just works the way we want. In particular, the overhead may be higher than libdispatch, but we know what we put in our thread pool, and we know our work items are fairly big; thread pool overhead doesn't show up in a profile.

There is one trick that we can do that libdispatch doesn't do (and probably shouldn't): inspecting the work queue. Our pool lets a client "edit" the queue, going in and pulling out any queued items that it has a grudge against. The main thread pool is locked for editing, so editing has to be fast and worth it.

Why allow such a weird operation? Well, X-Plane's work items are fairly expensive. And there is one case where this is a win. When it comes time to throw out a whole hunk of scenery, we need to wait until all of its enqueued mesh-building work tasks are done. We also, coincidentally, happen to throw out scenery at the same time that we shift the coordinate system, which is something you don't want to do while meshes are being built. The whole fire drill goes something like this:
  1. Insert a "barrier" into the queue. This is a custom object that basically clogs all of the worker threads. Once the barrier is in place we know that our worker threads are idle.
  2. "Edit" the queue removing any enqueued (but not running) tasks that belong to scenery that is going to get thrown out. This is strictly an optimization, but a nice one.
  3. Wait for the barrier to fall into place. Basically we need to wait for each core to finish whatever it was doing.
  4. At this point, the scenery system is "frozen". We can throw out any old scenery, and shift the coordinate system. (In fact, our barrier let's us reuse the hijacked, sleeping worker threads, so they do the shift.
  5. Pull the barrier out and let things resume as normal.
The win of "editing" out in-flight scenery asks is that it makes management of step 4 a lot easier. When we throw out old scenery, how do we know that we don't have pending tasks based on that mesh, tasks that would be operating on deallocated memory if we throw out their scenery underneath them? There are three ways to manage this:
  1. Ref-count all of the scenery. The "tasks" to operate on the scenery add a ref-count, effectively deferring deallocation of each mesh until it's task has completed. This sucks for one reason: memory is usually rather tight, and we might end up "ballooning" memory as new scenery is loaded while old scenery hasn't gone down to zero references and been thrown out.
  2. Simply wait for all of the tasks we care about to complete before we throw out old scenery. If the shift and load is waiting on this (see above about memory) then this sucks too - our tasks might be at the back of a very long queue.
  3. Option 3 is our "editing" trick. By first inserting the barrier and then editing out any tasks in the queue, we guarantee that by the time our barrier stops threaded operation, every task referencing the scenery we want to throw out has either been (a) run to completion or (b) edited out of the queue before it could run.
Final thoughts: a lot of this voodoo is necessary only because our coordinate system 'shifts' in a big global nasty ugly operation. I have been researching a way to run on a single coordinate system...the short of it is that it looks like we'd need programmable vertex shaders as minimum hardware to do this - someday but not just yet. Once we don't have to shift, all scenery operation could be truly async, and a shift becomes a matter of dependent async tasks:
  1. When we detect that we'd like scenery to shift, we mark the tiles we don't like as "dead" and tag each one with a completion task - the tile we'd like loaded when it is destroyed.
  2. The dead tile stops queuing async work.
  3. When the last task for the tile completes, it signals the tile, which queues itself for an asynch-destroy operation.
  4. When the tile destroys itself the last thing it does is queue an async load of the new tile.
What's cool about this "chaining" process is that it guarantees in-order sequential operation with respect to memory, but each dead tile and replacement tile might be doing this on the fly as quick as they can, with minimal access to the main thread.

Saturday, September 05, 2009

There Is No Substitute

Since I've already become quite old and crotchety I am going to start trying to say quotable things. If someone else already said this, I apologize - my google-fu might be weak. Anyway, here goes:
There's no substitute for understanding what your code actually does.
Am I a consultant yet?

Friday, September 04, 2009

Grand Central Dispatch

Apple has this new API in Snow Leopard called "Grand Central Dispatch". Basically it's a bunch of message queue facilities - the idea is that with concurrent message queues it will be easier for people to write scalable code on multicore machines; if the OS libs use the same API and queues, then there won't be thrash between library worker threads and app worker threads. (OS X does create a fair number of worker threads on your behalf - you can see them in the debugger in 10.5.)

So first of all: told you so.

Now that I got that out of my system: it was interesting to read Apple's documentation on how queues (their name for message queues whose messages are "do this") solve threading problems. The docs spend a lot of time describing how much you won't have to lock, but I don't think this is entirely true. The rule is pretty simple:
Resources that are owned entirely and exclusively by the "task" that has been dispatched to the queue do not need to be locked, because a task cannot be run in two places at once.
And that alone is a pretty nice improvement! (If I could have a nickel for every forum thread I read where people suggest two threads and a mutex as a way to do concurrent processing. The over-use of a mutex as something it is not is probably it's own blog topic.) At least if we can separate our data object to be processed from the world, now it is lock free.

But what do we do if that processing task needs to talk to the rest of our application? There are basically only a few options:
  1. Make the APIs that we have to call non-blocking and lock-free.
  2. Put fine-grained locking into the various APIs (e.g. use internal locks as needed). Now there is a risk that our tasks can block, but perhaps it won't be too bad. Use a performance analyzer to see if this is hosing us.
  3. Put some kind of course-grained lock on the entity the task will need - perhaps for the entire execution of the task, perhaps for longer. At this point you have to wonder if you're losing concurrency.
X-Plane uses a mix of strategies 2 and 3. Basically any API that is called from the main thread (which runs the rendering loop) gets optimized using technique 1 for minimum latency. The rest of the APIs can use techniques 1 or 2, whichever is easier.

An example of strategy 1: by using atomics to swap textures into their container objects, we don't need to obtain a lock on a texture object to use it.

Strategy 1 and 2 combine: a central registry of objects is used only for async loader code. Once we have an object, we are lock free. (Objects are reference counted, and reference counting can be lock-free - see also the glibc++ string implementation.)

A final comment: Apple suggests replacing fine-grained resource locking (e.g. grab a lock, do a few things, release the lock) with queueing a "block" (a small piece of code) on a serial queue (a message queue with only one thread). This can be done synchronously or asynchronously.
  • It's refreshing to see someone abusing threading primitives by suggesting that a semaphore be used instead of a mutex - normally on the net I see people doing just the opposite (and wondering why their design has 5000 edge cases and blows up as soon as they add a third thread).
  • If the block is queued asynchronously this is a performance win. Except for the part where you don't know if your work has actually been done. It's a great refactoring if you can get away with it but I wouldn't expect to hit this case any time soon.
  • If the block is queued synchronously this solution is exactly the same concurrency wise as a mutex.
Except...I'm not 100% sure of that last statement. If I take a lock on my own thread, there is no thread switch if the lock is not contested, and I might not even make a kernel switch (if the mutex is a critical section or another light-weight lock that can spin a few times before trapping).

By comparison, my understanding is that a message queue is at least a message queue - I would normally expect the serial thread with sync queue to have to "toss" the work to another thread, then wait (because if the main thread did the work, the worker thread would be available to pick up work tasks out of order). But perhaps Apple has optimized this, with a fast path to execute synchronously queued tasks on the calling thread when a serial queue is "idling".

Thursday, September 03, 2009

OpenGL Matrix Tricks: The Fastest Culling Path?

In my last post I suggested that a clever way to cull would be:
  • Translate the view clip planes back into eye space.
  • For each cull, translate the sphere center into eye space, then test against the planes.
This idea is not my own - I first saw the idea of using plane tests against spheres in SSG - and I do not know who first thought of the idea. I can certainly say it's a great way to do culling when your camera orientation can be arbitrary.

But we have to ask: if we can translate our clip planes back to eye space, can be translate them back to model-view space and save some time?

The short answer is: I don't know - to really tell you'd have to run statistical analysis on real data; culling has a lot of early exit cases so you need to know how many early exit cases you take to know the average "cost" of all operations.

To compare the computational cost, note that if we know we are using a frustum we can make a few optimizations to the cull. (X-Plane does take these optimizations.)
  • We can drop the "D" constant on our left, right, top and bottom planes because we know they pass through the origin, saving an add on each plane check.
  • Because we know that the near and far clip planes are parallel to the XY plane, we can simply do a single add and compare instead of a plane equation.
  • The cost of translating from model-view to eye space (3 adds, and 3 multiplies) is not paid for every time - we can transform only the Z coordinate, do a fast test against the near and far clip planes, and then only calculate X and Y in model-view space if we really need it.
If we had clip planes in model-view space:
  • We can skip 3 multiplies and 3 adds per coordinate as we translate from mode-view to eye space. We never need these.
  • We have to do a full four-part plane equation (3 multiplies, 3 adds, plus the add and compare for the sphere volume) for all six planes, since their orientation is now arbitrary.
Adding it all up, they are very similar in the worst case: model-view clip planes saves 9 adds and multiplies (the transform) but picks up six multiplies and ten adds due to more expensive plane equations.

But the early exit case looks better to me in the SSG case: 3 multiplies, 5 adds, and two compares to get through both the near and far clip planes vs. 6 multiplies, 8 adds and two compares for the model-view case.

There is one other possible issue with using model-view clip planes: you need a combined model-view + projection matrix to find the clip planes. Depending on how often your model-view matrix is changed and how it is calculated, this might mean calculating a combined matrix that you would not normally have to calculate.

(Whether this is really the case is going to depend a lot on how your program shadows OpenGL state, handles transforms, and what your shaders look like, as well as the ratio of culls to view transforms. For what it's worth, if you are transforming the camera a lot relative to culls, you may have other, non-culling related performance bottlenecks.)

Wednesday, September 02, 2009

OpenGL Matrix Tricks: Camera Properties

In my last post I pointed out that:
  • Planes have vectors "embedded" in them (the normal vector).
  • We can quite cheaply backward-transform a plane, because planes are transformed by a transposed inverted matrix - if we want to go "backward" the two matrix inversions cancel each other out.
Knowing this, we can play some fun camera tricks using OpenGL's model-view and projection matrices.

Finding The Clip Planes

To cull a sphere, we can transform the sphere's center into eye space, then test it's distance to each of the 6 viewing clip planes (in eye space). If we are outside the clip planes by more than the sphere radius, it is out of camera view.

(We don't want to do this in clip space because clip-space's scale is non-uniform - comparing a distance to the radius of the sphere would be very wrong.)

But where do we get our culling planes? Well, the clip planes in clip space are very simple, because clip space is just a 2-unit cube centered on the origin. The clip planes are therefore:
  • Left clip: (1,0,0,1) - that is 1x + 0y + 0z + 1 = 0, also known as X + 1 = 0.
  • Right clip: (-1,0,0,1)
  • Bottom clip: (0,1,0,1)
etc. etc.

So all we need to do is multiply these planes by the transposed projection matrix and we have our clip planes not in clip space but in eye space. In practice, we can "multiply out" the transpose and the multiply and get a code snippet that looks like this.

Basically the cost of finding our clip planes in eye space from the projection matrix is just a bunch of adds and subtracts. Cheap! I should also note that this works for any projection matrix - ortho, frustum, oblique frustum, or Dr. Demento's totally insane deformed matrix.

Finding Billboard Vectors

A billboard is a quad whose points are entirely within the XY plane in eye space. That is, it always faces the camera. To calculate a billboard, we need to find vectors in model-view space that map to "right" and "up" in eye space.

It turns out that this is a lot like the clip-plane problem above. Consider the YZ plane, facing "right" (that is, a plane whose plane coefficients are 1,0,0,0). We can transform that from eye to model-view space, then recover our normal vector and that's which way "right" is.

Once again we're taking a few static plane equations, running them through a transposed matrix, and out comes our answers. Here's a code snippet.

Another Way To Look At Billboarding

It turns out there's an even easier way to understand how the billboarding example works than transforming plane equations by a transposed matrix.

An affine transformation matrix consists of rows of basis vectors - that is, each row of the matrix is the destination X axis (then Y, then Z) in the source coordinate system.

So if we want to know which way "right" is in eye space (but we want our numbers in model-view space) we just take the top row of the 3x3 upper-left part of the model-view matrix.

Where Is My Camera Location?

Sadly you can't get the camera location in model-view space out of the model-view matrix. Our previous trick of reverse-transforming a vector or plane fails because the origin is a point - it needs to be "forward" transformed, which from eye space to model view space means having a real inverted model-view matrix. (Once you have that, the right-most column is the origin.)

The problem is that the origin, as embedded in the model-view matrix (in the right-most column) is the origin after rotations. This is generally not what we want.

(I suppose that, theoretically, we could invert the 3x3 upper left of the model view matrix, then apply this to the right-most column to find our origin, and I suppose this would be faster than calculating a full inverse matrix. I haven't tried this though - in X-Plane we simply invert the model view matrix as needed.)

OpenGL Matrix Tricks: Transforms

One of the cool things about OpenGL is that, after doing a ton of advanced math to work out your matrices, usually the end results boil down to an astoundingly small amount of C++.

Transforming Vectors and Normals

A vector can be defined via a point whose distance from the origin is zero. Really we need 3 coordinates. If we have a fourth coordinate, what do we store there?

My short answer is: why do you have a fourth coordinate? Don't waste space, most GPUs are faster when operating on less coordinates in shaders as well as bus bandwidth. But if you had to pick something, zero is probably a good choice, as we'll see below.

Now you may have read that you can't just transform your vectors the way you transform your points. The diagrams here explain why.

I have to point out that to a math professor, that "stretched" vector is exactly the expected result from applying a non-uniform scaling matrix to a vector.

The true problem is that we want the normal vector to stay normal - in other words, we really want to transform the plane that the normal vector is normal to, and then recover the normal vector. This is pretty different from actually transforming the vector itself, but you gotta know what you want.

See below for plane transformations, but plane transformations are expensive, because they require calculating the inverse of our transform matrix.

If we know that we don't have any scaling on our matrices (uniform scaling would change the magnitude of the vector, which can be undone I suppose) then we can transform the normal vector directly with our transform matrix. The trick is to only use the 3x3 upper-left corner (or simply set the fourth coordinate to zero). This will remove the translation part of our matrix and leave only the rotations. If we transform the vertex, it's meaning relative to the origin would be lost. See here for more details.

Important: you might think that you can transform your vectors by transforming the vector and the origin, and then subtracting. This is a very bad idea! The problem is that if the origin of your coordinate system is not close to zero, you'll lose a lot of precision on your normal vectors by moving away from the origin. I have had this implementation in X-Plane and the loss of precision quickly turns into visible artifacts.

Transforming Planes

A plane is represented by four values (typically ABCD, as in Ax + By + Cz + Dw = 0), and notation-wise, you would think of a plane as a column. (Since a point is a row, we are saying that point dot plane = 0.) ABC is the normal vector to the plane, and D is a scalar that defines which of the infinite number of parallel planes we have.

To transform a plane, you need to transform it by the transpose of the inverse of our matrix. See here for a derivation.

It should be noted that when you want to reverse-transform a vector (e.g. go from eye coordinates to world coordinates) this works in our favor.

When transforming a plane backward: tranpose(inverse(inverse(MV)) is just transpose(MV) - in other words, the inverse of our "forward transform" (E.g. the model view matrix) cancels out the inverse needed to transform a plane equation. So we can simply turn our vector into a plane with that vector as its normal, quickly transform it, and then pull out our vector.

More Reading

A good explanation of homogeneous coordinates.

Next time: Camera Tricks