Wednesday, June 01, 2011

Guessing the Fine Print

One of the great things about OpenGL is that it can draw things really fast.

One of the great things about OpenGL is that it's really flexible.

But is it fast and flexible? No. There are "fast paths"; ask the GL to do something adequately byzantine and it's going to get the job done by a correct but not particularly optimized driver path.

I have had one or two occasions to peer over the shoulder of driver writers and see what a production driver looks like. Here's a taste. (Note: this is made up for NDAs were harmed in the creation of this example.)

/* Figure out if we can push vertices through the fast path. */

#if X2913_GPU
if (
vbo->internal.struct_align % STRUCT_ALIGN_MOD == 0 &&
(vbo->source_mode == SOURCE_AGP || !vbo->resident) &&
vbo->current_day != AGP_YES_IT_IS_TUESDAY &&
vbo->internal.size > MIN_SIZE_FOR_FAST_PATH &&
vbo->internal.size < MAX_SIZE_FOR_FAST_PATH &&
IS_SIZE_WE_LIKE(vbo->internal.size) &&
/* do accelerated case *
} else
/* another 50,000 conditions have to be met. */

What we have here might qualify as a leaky abstraction (at least with respect to performance): the fast path isn't obvious from the OpenGL API, but it matters.

Well, every now and then, you get to see yourself fall off the fast path. Ouch!

This is is a screenshot of an instruments 2.x trace (with the time profiler - 1.x won't give you this kind of info) of X-Plane with a fast path failure. In this case, we do a glCopyTexSubImage2D and...bad things happen! It's taking 67% of our frame time! In this case, we can sort of guess what the driver might be doing.
  • 57% of the time goes into a gldFinish - I speculate that that's Apple asking nVidia to finish filling pixels on a surface. This of course goes through into Kernel space and spends a lot of time doing things that have "wait" in them.
  • Another 8.2% is in glgProcessPixelsWithProcessor - that sounds a lot like Apple using the host to do some kind of pixel op.
Put it together and we realize: whatever we asked for, the driver can't do it in hw, so instead the OpenGL stack is stalling until the GPU completes, reading the data back to the host, and processing it.

Driver monitor confirms this - with non-zero "time spent waiting in user code" (meaning a call that might not block is blocking) and a non-zero texture page off bytes (meaning something in VRAM had to be copied back to the host). This is not what we want out of glCopyTexImage2D. Generally we never want to copy anything off of the GPU and we never want to wait in host.

What did it turn out to be? Well, the first surprise was that we were using glCopyTexImage2D at all (and not using an FBO). It turns out that we were reading back from an RGBA16F surface into an RGBA8 texture in a misguided attempt to cope with mismatching gamma. Of course, the driver could in theory build a custom shader to make that transformation, but it's very reasonable to expect a punt. Getting the two surfaces to the same format and using an FBO fixed the problem.

EXT vs ARB - The Fine Print

I thought I had found a driver bug: my ATI card on Linux rejecting a G-Buffer with a mix of RGBA8 and RG16F surfaces. I know the rules: DX10-class cards need the same bit plane width for all MRT surfaces.

I had good reason to think driver bug: the nappy old drivers I got from Ubuntu 10.04 showed absolutely no shadows at all, weird flicker, incorrect shadow map generation - and cleaning them out and grabbing the 11-5 Catalyst drivers fixed it.

Well, when the drivers don't work, there's always another explanation: an idiot app developer who doesn't know what his own app does. In that guy's defense, maybe the app is large and complex and has distinct modules that sometimes interact in surprising ways.

In my case, the ATI drivers follow the rules: the "EXT" variant of the framebuffer extension: an incomplete format error is returned if the color attachments aren't all of the same internal type. This was relaxed in the "ARB" variant, which gives you more MRT flexibility.

What amazes me is that the driver cares! The driver actually tracks which entry point I use and changes the rules for the FBO based on how I got in. Lord knows what would happen if I mixed and matched entry points. I feel bad for the poor driver writers for having to add the complexity to their code to manage this correctness.