In a world before manual synchronizing of VBOs (via the 3.0 buffer management APIs or Apple's buffer range extensions) we can now see why a sub-data buffer on a streamed VBO might perform quite badly.
The naive code goes something like this:
- Fill half the buffer with buffer sub-data.
- Issue a draw call to that half of the buffer.
- Flip which half of the buffer we are using and go back to step 1.
This implementation is going to perform terribly. T sub-data call is going to block until the previous draw call has completed, even though they use opposite halves of the buffer, and we'll lose all of our concurrency. Let's see if we can understand why.
If we go to respecify a VBO in AGP memory using glBufferSubData while that VBO is in progress, glBufferSubData must block; it can't rewrite the buffer until the last draw finishes because we would see the new vertices, not the old, or maybe half and half. In order for the "fill" to complete, the driver would have to be able to determine that the pending draws and the new fill are completely disjoint.
There are two reasons why the driver might not be able to figure this out:
- You've drawn using glDrawElements, and thus the actual part of the vertex VBO you draw from is determined by the index table. The cost of figuring out the "extent" of this draw is to process all of the indices. The cure is worse than the disease. Any sane driver is going to simply assume that any part of the VBO could be used.
- Let's assume you use glDrawRangeElements to tell the driver that you're really only going to use half the VBO. Even then, the structure to mark "locked" regions would be a complex one - a series of draws over overlapping regions would require a complex data structure. For this one special case, you're asking the drivers to replace a simple time-stamp based lock (e.g. this VBO is locked until this many commands have executed) with a dynamic range marking structure. If I were a driver writer I'd say "let's keep it simple and not eat this cost on all VBOs."
Can we do anything about this? Besides falling back to an "orphaned" approach where we get a fresh buffer each time, our alternative is to use the more exact APIs from ARB_map_buffer_range or APPLE_flush_buffer_range. With these APIs we can map only the part of the VBO we know is not in use, with the unsynchronized bit set to avoid blocking because the other half is pending. We can use flush explicit to then flush only the areas we modified. (With the 3.0 APIs we can also use the discard range option to simply say "we are rewriting what we map".)
Of course, this technique isn't without peril - all synchronization is up to the client. The main danger is an over-run: your app is so fast that it needs to modify a range that the GL isn't done with - we made it all the way around our ring buffer. Probably the safest way to cope with this is to put explicit fences in place to wait until the last dependent draw call that we issued is finished.