Wednesday, August 11, 2010

When Is Your VBO Double Buffered?

A while ago I finally wrapped my head around this, and wrote a three part post trying to explain why you never get double-buffered behavior from a VBO unless you orphan it. This is going to be an attempt to explain the issues more succinctly and describe how to stream data through a VBO.

The Problem

The problem that OpenGL developers (myself included) crash into is a stall in the OpenGL pipeline when trying to specify vertex data that changes every frame. You go preparing new VBOs of meshes (for a particle system for example), and when you go into your favorite adaptive sampling profiler, you find that you're blocking in one of glMapBuffer or glBufferSubData.

The problem is that the GPU has a "lock" on your VBO until it finishes drawing from it, preventing you from changing the VBO's contents. You can't put the new mesh in there until the GPU is done with the old one.

To understand why this happens, it helps to play "what if I had to write the GL driver myself" and look at what a total pain in the ass it would be to fix this at the driver level. In particular, even if you did the work, your GL driver might be slower in the general case because of the overhead to be clever in this case.

Your VBO Isn't Really Double-Buffered

Sometimes VBOs are copied from system memory to VRAM to be used. We might naively think that if this were the case, then we could gain access to the original system copy to update it while the GPU uses the VRAM copy.

In practice, this would be insanely hard to implement. First, this scheme would only work when VBOs are being shadowed in VRAM (not the case a lot of the time) and when the VBO has already been copied to VRAM by the time we need to respecify its contents.

If we haven't copied the VBO to VRAM, we'd have to stop and block application code while we DMA the VBO into VRAM (assuming the DMA engine isn't busy doing something else). If DMA operations on the GPU have to be serialized into the general command queue, that means the DMA operation isn't going to happen for a while.

If that hasn't already convinced you that treating VRAM vs. main memory like a double buffer makes no sense, consider also that if main memory is to be released, the VRAM copy is no longer a cached shadow, it is now the only copy! We now have to mark this block as "do not purge". So we might be putting more pressure on VRAM by relying on it as a double buffer.

I won't even try to understand the complexity that a pending glReadPixels into the VBO would have. It should be clear at this point that even if your VBO seems double buffered by VRAM, for the purpose of streaming data, it's not.

Your VBO Isn't Made Up of Regions

You might not be using all of your VBO; you might draw from one half and update the other. glBufferSubData won't figure that out. In order for it to do so, it would have to:
  • Know the range of the VBO used by all pending GPU operations. (This is in theory possible with glDrawElementsRange, but not the older glDraw calls.)
  • Track the time stamp of each individual range to see how long we have to block for.
The GPU on our VBO has now changed from an integer time stamp to some kind of diverse region of time stamps with set operations. It's not surprising that the drivers don't do this. If you have a pending operation on any part of your VBO, glBufferSubData will block.

One Way To Get a Double Buffer

The one way to get a double buffer on older OpenGL implementations is to re-specify the data with glBufferData and a NULL pointer. Most drivers will recognize that in "throwing out" the contents of your buffer, you are separating the contents of the buffer for future ops from what is already in the queue for drawing. The driver can then allocate a second master block of memory and return that at the next glMapBuffer call. The driver will throw out your original block of memory later at an unspecified time once the GPU is done with it.

Alternatively, if you are on OS X or have a GL 3.0 extension, there are extension that let you check out and operate on a buffer with locking suspended, allowing you to manage sub-regions in your buffer independently.

Friday, August 06, 2010

Restarting the OS X Window Server for Fun and Profit

Well, it's not very profitable. Hell, it's not even that fun. But let's just say, hypothetically, that you were working on a flight simulator with an OpenGL rendering engine. And let's just say, to make this interesting, that if you crank up all of the new rendering engine options, sometimes it causes the OpenGL stack to completely lose its meatballs, and the resulting carnage renders the entire computer unusable.

(If you are having trouble imagining this, close your eyes and visualize a desktop where nothing but the mouse moves, but as you drag what were your windows, small pieces of your scene graph flicker in and out of what used to be your open windows, as if you were just showing random parts of video memory. Okay, maybe it is a little bit fun.)

Here's what you need to get your life back:
  1. Have remote ssh enabled in the sharing control panel. ssh into your machine. Odds are, the remote shell is perfectly happy, even if the desktop looks like you hired Picasso as your art lead and he was extra high that day.
  2. Kill -9 pid will bring back the desktop some of the time. That is, sometimes just killing off your app is enough to get your desktop back. Typically this is a win in the case where the driver is constantly resetting and you just can't use the UI because the reset cycle is slow.
  3. If that doesn't work, this will kill off the entire window manager (including, um, everything...the Finder, your app, X-Code, icanhazcheesburger): sudo killall -HUP WindowServer
It beats a full reboot (by some marginal amount).

A Healthy Fear of Threading

Continuing in the line of pithy quotes:
There are only two kinds of programmers: programmers with a healthy fear of threaded code and programmers who should fear code.
Now I'm not saying "never thread". I'm just saying "you better be getting something good for that threading, because it's driving up your development costs."

In particular, the effective execution order of threaded code can change with every run, and there is no guarantee that you have seen every combination of execution order by running your program a finite number of times.

Thus methods of checking your code quality by running your program (perhaps many times) won't detect bugs in threaded code. You may not find out until that user with one more core and a background program chewing up cycles hits an execution order that you haven't seen yet.

Instead for threaded code you have to prove logically that the execution order constraints applied (via locking, etc.) create a bounded set of execution combinations, and that each one is correct. This isn't quick or easy to do.

One way we cope with this development cost in X-Plane (where we need to use threads to fully utilize multiple cores) is to use threading design patterns with known execution limits. The most common one is a message queue, where ownership of data access flows with the message down a queue. This idiom not only guarantees serialized access to data without locks, but the implementation in C++ tends to make errors rare; if you have the message you have the pointer, and thus you have rights on the data. If you don't have the message, you have nothing to dereference.