Wednesday, April 04, 2012

Beyond glMapBuffer

For a while X-Plane has had a performance problem pushing streaming vertices through ATI Radeon HD GPUs on Windows (but not OS X).  Our initial investigation showed that glMapBuffer was taking noticeable amounts of time, which is not the case on other platforms.  This post explains what we found and the work-around that we are using.

(Huge thanks to ATI for their help in giving me a number of clues!  What we did in hindsight seems obvious, but with 536 OpenGL extensions to choose from we might never have found the right solution.)

What Is Streaming?

To be clear on terminology, when I say streaming, I mean a vertex stream going to the GPU that more-or-less doesn't repeat, and is generated by the CPU per frame.  We have a few cases of these in X-Plane: rain drops on the windshield, car headlights (since the cars move per frame, we have to respecify the billboards every frame) and the cloud index buffers all fit into this category.  (In the case of the clouds, the Z sort is per frame, since it is affected by camera orientation; the puff locations are static.)

In all of these cases, we use an "orphan-and-map" strategy: for each frame we first do a glBufferData with a NULL ptr, which effectively invalidates the contents of the buffer at our current time stamp in the command stream; we then do a glMapBuffer with the write flag.  The result of this is to tell the driver that we want memory now and we don't care what's in it - we're going to respecify it anyway.

Most drivers will, in response to this, prepare a new buffer if the current one is still in-flight.  The effect is something like a FIFO, with the depth decided by the driver.  We like this because we don't actually know how deep the FIFO should be - it depends on how far behind the GPU is and how many GPUs are in our SLI/CrossFire cluster.

Why Would Streaming Be Expensive?

If orphaning goes well, it is non-blocking for the CPU - we keep getting new fresh buffers from the driver and can thus get as far ahead of the GPU as the API will let us.  So it's easy to forget that what's going on under the hood has the potential to be rather expensive.*  Possible sources of expense:
  • If we really need a new buffer, that is a memory allocation, by definition not cheap (at least for some values of the word "cheap").
  • If the memory is new, it may need a VM operation to set its caching mode to uncached/write-combined.
  • If the driver doesn't want to just spit out new memory all of the time it has to check the command stream to see if the old memory is unlocked.
  • If the driver wants to be light on address space use, it may have unmapped the buffer, so now we have a VM operation to map the buffer into our address space.  (From what I can tell, Windows OpenGL drivers are often quite aggressive about staying out of the address space, which is great for 32-bit games.  By comparison, address space use with the same rendering load appears to always be higher on Mac and Linux.  64 bit, here we come!)
  • If orphaning is happening, at some point the driver has to go 'garbage collect' the orphaned buffers that are now floating around the system.
Stepping back, I think we can say this: using orphaning is asking the driver to implement a dynamic FIFO for you.  That's not cheap.

Real Performance Numbers

I went to look at the performance of our clouds with the following test conditions:
  • 100k particles, which means 400k vertices or 800k of index data (we use 16-bit indices).
  • 50 batches, so about 2000 particles per VBO/draw call.
  • The rest of the scenery system was set quite light to isolate clouds, and I shrunk the particle size to almost nothing to remove fill rate from the calculation.
(In this particular case, we need to break the particles into batches for both Z sorting and culling reasons, and the VBOs aren't shared in a segment-buffer-like scheme due to the scrolling of scenery.)
Under these conditions, on my i5-2500 I saw these numbers.  The 0 ms really means < 1 ms, as my timer output is only good +/- 1 ms.  (Yes, that sucks...the error is in the UI of the timer, not the timer itself.)
  • NV GTX 580, 296.xx drivers: 2 ms to sort, 0 ms for map-and-write, 0 ms for draw.
  • ATI Radeon 7970 12-3 drivers: 2 ms to sort, 6 ms to map, 1 ms to write, 1 ms to plot.
That's a pretty huge difference in performance!  The map+write and draw is basically free on NV hardware, but costs 8 ms on ATI hardware.

glMapBufferRange Takes a Lock

In my original definition of streaming I describe the traditional glBufferData(NULL) + glMapBuffer approach.  But glMapBufferRanged provides more explicit semantics - you can  pass the GL_MAP_INVALIDATE_BUFFER_BIT flag to request a discard-and-map without calling glMapBuffer.  What surprised me is that on ATI hardware this performed significantly worse!
It turns out that as of this writing, on ATI hardware you also have to pass GL_MAP_UNSYNCHRONIZED_BIT or the map call will block waiting on pending draw calls.  The more backed up your GPU is, the worse the wait; while the 6 ms of map time above is enough to care a lot, blocking on a buffer can cut your framerate in half.
I believe that this behavior is unnecessarily conservative - since I don't see buffer corruption with unsynchronized + invalidation, I have to assume that they are mapping new memory, and if  that is the case, the driver should always return immediately to improve throughput.  But I am not a driver writer and there's probably fine print I don't understand.  There's certainly no harm in passing the unsynchronized bit.

With invalidate + unsynchronized, map-buffer-range has the same performance as bufferdata(NULL)+map-buffer.  Without the unsynchronized bit, map-buffer-range is really slow.

Map-Free Streaming with GL_AMD_pinned_memory

Since glMapBufferRange doesn't do any better when using it for orphaning, I tried a different path: GL_AMD_pinned_memory, an extension I didn't even know existed until recently.

The pinned memory extension converts a chunk of your memory into an AGP-style VBO - that is, a VBO that is physically resident in system memory, locked down, write-combined, and mapped through the GART so the GPU can see it.  (In some ways this is a lot like the old-school VAR extensions, except that the mapping is tied to a VBO so you can have several in flight at once.)  The short of it:
  • Your memory becomes the VBO storage.  Your memory can't be deallocated for the life of the VBO.
  • The driver locks down your memory and sets it to be uncached/write-combined.  I found that if my memory was not page-aligned, I got some really spectacular failures, including but not limited to crashing the display driver.  I am at peace with this - better to crash hard and not lose fps when the code is debugged!
  • Because your memory is the VBO, there is no need to map it - you know where it's base address is and you can just write to it.
This has a few implications compared to a traditional orphan & map approach:
  1. In return for saving map time, we are losing any address space management.  The pinned buffer will sit in physical memory and virtual address space forever.  So this is good for streaming amounts of data, but on a large scale might be unusable.  For "used a lot, mapped occaisionaly" this would be worse than mapping.  (But then, in that case, map performance is probably not important.)
  2. Because we never map, there is no synchronization.  We need to be sure that we are not rewriting the buffer for the next frame while the previous one is in flight.  This is the same semantics as using map-unsynchronized.
On this second point, my current implementation does the cheap and stupid thing: it allocates 4 segments of the pinned buffer, allowing us to queue up to 4 frames of data; theoretically we should be using a sync object to block in the case of "overrun".  The problem here is that we really never want to hit that sync - doing so indicates we don't have enough ring buffer.  But we don't want to allocate enough FIFO depth to cover the worst CrossFire or SLI case (where the outstanding number of frames can double or more) and eat that cost for all users.  Probably the best thing to do would be to fence and then add another buffer to the FIFO every time we fail the fence until we discover the real depth of the renderer.
With pinned memory, mapping is free, the draw costs about 1 ms and the sort costs us 2 ms; a savings of about 6-7 ms!

Copying the Buffer

The other half of this is that we use glCopyBuffer to "blit" our pinned buffer into a static-draw VRAM based VBO for the actual draw.  Technically we never know where any VBO lives, but with pinned memory we can assume it's an AGP buffer; this means that we eat bus bandwidth every time we draw from it.

glCopyBufferData is an "in-band" copy, meaning that the copy happens when the GPU gets to it, not immediately upon making the API call.  This means that it won't block because the previous draw call that uses the destination buffer hasn't completed.

In practice, for our clouds, I saw better performance without the copy - that is, drawing vertices from AGP was quicker than copying to VRAM.  This isn't super-surprising, as the geometry gets used only twice, and it is used in a very fill-rate expensive way (drawing alpha particles).**  We lost about 0.5 ms by copying to VRAM.

Sanity Checks

Having improved cloud performance, I then went to look at our streaming light billboard and streaming spill volume code and found that this code was mistuned; the batch size was set low for some old day when we had fewer lights.  Now that our artists have had time to go nuts on the lighting engine, we were doing 5000 maps/second due to poor bucketing.

For that matter, the total amount of data being pushed in the stream was also really huge.  If there's a moral to this part of the story it is: sometimes the best way to make a slow API fast is to not use it.

Better Than Map-Discard

Last night I read this Nvidia presentation from GDC2012, and it surprised me a little; this whole exercise had been about avoiding map-discard on ATI hardware for performance reasons - on NVidia hardware the driver was plenty fast.  But one of the main ideas of the paper is that you can do better than map-discard by creating your own larger ring buffer and using a sub-window.  For OpenGL I believe you'd use unsynchronized, discard-range, and the write flags and map in each next window as you fill it.

The win here is that the GPU doesn't actually have to manage more than one buffer; they can do optimal things like just leave the buffer mapped for a while or return scratch memory and DMA it into the buffer later.  This idiom is still a map/unmap though, so if the driver doesn't have special optimization to make map fast, it wouldn't be a win.

(That is, on ATI hardware I suspect that ring-buffering your pinned VBO is better than using this technique.  But I must admit that I did not try implementing a ring buffer with map-discard-range.)

The big advantage of using an unsynchronized (well, synchronized only by the app) ring buffer is that you can allocate arbitrary size batches into it.  We haven't moved to that model in X-Plane because most of our streaming cases come in large and predictable quantities.

* In all of these posts, I am not a driver writer and have no idea what the driver is really doing.  But on the one or two times I have seen real live production OpenGL driver code, I have been shocked by how much crud the driver has to get through to do what seems like a cheap API call.  It's almost like the GL API is a really high level of abstraction!  Anyway, the point of this speculation is to put into perspective why, when confronted with a slow API call, the right response might be to stop making the call, rather than to shout all over the internet that the driver writers are lazy.  They're not.

** When drawing from AGP memory, the rate of the vertex shader's advancing through the draw call will be limited by the slowest of how fast it can pull vertex data over the AGP bus and how fast it can push finished triangles into the setup engine.  It is reasonable to expect that for large numbers of fill-heavy particles, the vertex shader is going to block waiting for output space while the shading side is slow.  Is the bus idle at this point?  That depends on whether the driver is clever enough to schedule some other copy operation at the same time, but I suspect that often that bus bandwidth is ours to use.