Tuesday, March 16, 2010

Santa to Ben: You're An Idiot

A few months ago I posted a request (to Santa) for parallel command dispatch. The idea is simple: if I am going to render several CSM shadow map levels and the scene graph contained in each does not overlap, then each one is both (1) independent in render target and (2) independent in the actual work being done. Because the stuff being rendered to each shadow map is different, using geometry shaders and multi-layer FBOs doesn't help.* My idea was: well I have 8 cores on the CPU - if the GPU could slurp down and run 8 streams, it'd be like having 8 independent GLs running at once and I'd get through my shadow prep 8x as fast.

I was talking with a CUDA developer and finally I got a clue. The question at hand was whether CUDA actually runs parallel kernels. Her comment was that while you can queue multiple kernels asynchronously, the goal of such a technique is to keep the GPU busy - that is, to keep the GPU from going idle between batches of kernel processing. The technique of multiple kernels isn't necessary to keep the GPU fully busy, because even iwth hundreds of shader units, the kernel is going to run over thousands or tens of thousands of data points. That is, CUDA is intended for wildly parallel processing, so the entire swarm of "cores" (or "shaders"?) is still smaller than the number of units of work in a batch.

If you submit a tiny batch (only 50 items to work over) there's a much bigger problem than keeping the GPU hardware busy - the overhead of talking to the GPU at all is going to be worse than the benefit of using the GPU. For small numbers of items, the CPU is a better bet - it has better locality to the rest of your program!

So I thought about that, then turned around to OpenGL and promptly went "man am I an idiot". Consider a really trivial case: we're preparing an environment map, it's small (256 x 256) and the shaders have been radically reduced in complexity because the environment map is going to be only indirectly shown to the user.

That's still at least 65,536 pixels to get worked over (assuming we don't have over-draw, which we do). Even on our insane 500-shader modern day cards, the number of shaders is still much smaller than the amount of fill we have to do. The entire card will be busy - just for a very short time. (In other words, graphics are still embarrassingly parallel.)

So, at least on a one GPU card, there's really no need for parallel dispatch - serial dispatch will still keep the hardware busy.

So...parallel command dispatch? Um...never mind.

This does beg the question (which I have not been able to answer with experimentation): if I use multiple contexts to queue up multiple command queues to the GPU using multiple cores (thus "threading the driver myself") will I get faster command-buffer fill and thus help keep the card busy? This assumes that the card is going idle when it performs trivially simple batches that require a fair amount of setup.

To be determined: is the cost of a batch in driver overhead (time spent deciding whether we need to change the card configuration or real overhead (e.g. we have to switch programs and GPU isn't that fast at it). It can be very hard to tell from an app standpoint where the real cost of a batch lives.

Thanks to those who haunt the OpenGL forums for smacking me around^H^H^H^H^H^H^H^H^Hsetting me straight re: parallel dispatch.

* geometry shaders and multilayer FBO help, in theory, when the batches and geometry for each rendering layer are the same. But for a cube map if most of the scene is not visible from each cube face, then the work for each cube face is disjoint and we are simply running our scene graph, except now we're going through the slower geometry shader vertex path.

Wednesday, March 10, 2010

The Value Of Granularity

OpenGL is a very leaky abstraction. It promises to draw in 3-d. And it does! But it doesn't say a lot about how long that drawing will take, yet performance is central to GL-based games and apps. Filling in this gap is transient information about OpenGL and its current dominant implementations that isn't easy to come by - it comes from a mix of insight from other developers, connecting the dots, reading many disparate documents, and direct experimentation. This isn't easy for someone who isn't working full time as an OpenGL developer, so I figure there may be some value to blogging things I have learned the hard way about OpenGL while working on X-Plane.

OpenGL presents new functionality via extensions. (It also presents new functionality via version numbers, but the extensions tend to range ahead of the version numbers because the version number can only be bumped when allrequired extensions are available.) When building an OpenGL game you need a strategy for coping with different hardware with different capabilities. X-Plane dates back well over a decade, and has been using OpenGL for a while, so the app has had to cope with pretty much every extension being not available at one point or another.

Our overall strategy is to categorize hardware into "buckets". For X-Plane 9 we have 2.5 buckets:
  • Pre-shader hardware, running on a fixed function pipeline.
  • Modern shader enabled hardware, using shaders whenever possible.
  • We have a few shaders that get cased off into a special bucket for the first-gen shader hardware (R300, NV25), since that hardware has some performance and capability limitations.
These buckets then get sliced up by features the user select, but these don't complicate the buckets - we simply make sure we can shade without per pixel lighting, for example, if the user wants higher framerate.

So here is what has turned out to be surprising: we were basically forced to allow X-Plane to run with a very granular set of extensions for debugging purposes. An example will ilWhat lustrate.

Using the buckets strategy you might say: "The shader bucket uses GLSL, FBOs, and VBOs. Any hardware in that category has all three, so don't write any code that uses GLSL but not FBOs, or GLSL but not VBOs." The idea is to save coding by reducing the combination of all possible OpenGL hardware (we have eight combos of these three extensions) to only two combinations (have them all, don't have them all).

What we found in practice was that being able to run in a semi-useful state without FBOs but with GLSL was immensely useful for in-field debugging. This is not a configuration we'd ever want to really support or use, but at least during the time period that we started using FBOs heavily, the driver support for them was spotty on the configurations we hit in-field. Being able to tell a user to run with --no_fbos was an invaluable differential to demonstrate that a crash or corrupt screen was related specifically to FBOs and not some other part of OpenGL.

As a result, X-Plane 9 can run with any of these "core" extensions in an optional mode: FBOs, GLSL, VBOs (!), PBOs, point sprites, occlusion queries, and threaded OpenGL. That list matches a series of driver problems we ran across pretty directly.

Maintaining a code base that supports virtually every combination is not sustainable indefinitely, and in fact we've started to "roll up" some of these extensions. For example, X-Plane 9.45 requires a threaded OpenGL driver, whereas X-Plane 9.0 would run without it. We remove support for individual extensions going missing when tech support calls indicate that "in field" the extension is now reliable.

At this point it looks like FBOs, threaded OpenGL, and VBOs are pretty much stable. But I believe that as we move forward into newer, weirder OpenGL extensions, we will need to keep another set of extensions optional on a per-feature basis as we find out the hard way what isn't stable in-field.