The Hacks of Life: glMapBuffer No Longer Cool

Thursday, June 18, 2015

glMapBuffer No Longer Cool

TL;DR: when streaming uniforms, glMapBuffer is not a great idea; glBufferSubData may actually work well in some cases.

I just fixed a nasty performance bug in X-Plane, and what I found goes directly against stuff I posted here, so I figured a new post might be in order.

A long time ago I more or less wrote this:

When you want to stream new data into a VBO, you need to either orphan it (e.g. get a new buffer) or use the new (at the time) unsynchronized mapping primitives and manage ranges of the buffer yourself.
If you don't do one of these two things, you'll block your thread waiting for the GPU to be done with the data that was being used before.
glBufferSubData can't do any better, and is probably going to do worse.

Five years is a long time in GPU history, and those rules don't quite apply.

Everything about not blocking on the GPU with map buffer is still true - if you do a synchronized map buffer, you're going to block hard. Never do that!

But...these days on Windows, the OpenGL driver is running in a separate thread from your app. When you issue commands, it just marshals them into a FIFO as fast as it can and returns. The idea is to keep the app rendering time and driver command buffer assembly from being sequential.

The first problem is: glMapBuffer has to return an actual buffer pointer to you! Since your thread isn't actually doing real work, this means one of two things:

Blocking the app thread until the driver actually services the requests, then returning the result. This is bad. I saw some slides a while back where NVidia said that this is what happens in real life.
In theory under just the right magic conditions glMapBuffer could return scratch memory for use later. It's possible under the API if a bunch of stuff goes well, but I wouldn't count on it. For streaming to AGP memory, where the whole point was to get the real VBO, this would be fail.

It should also be noted at this point that, at high frequency, glMapBuffer isn't that fast. We still push some data into the driver via client arrays (I know, right?) because when measuring unsynchronized glMapBufferRange vs just using client arrays and letting the driver memcpy, the later was never slower and in some cases much faster.*

Can glBufferSubData Do Better?

Here's what surprised me: in at least one case, glBufferSubData is actually pretty fast. How is this possible?

A naive implementation of glBufferSubData might look like this:
void glBufferSubData(GLenum target, GLintptr offset, GLsizeiptr size, const GLvoid * data)
{
GLvoid * ptr = glMapBuffer(target,GL_WRITE_ONLY);
memcpy(ptr, data, size);
glUnmapBuffer(target);
}
The synchronized map buffer up top is what gets you a stall on the GPU, the thing I was suggesting is "really really bad" five years ago.

But what if we want to be a little bit more aggressive?
void glBufferSubData(GLenum target, GLintptr offset, GLsizeiptr size, const GLvoid * data)
{
if(offset == 0 && size == size_of_currently_bound_vbo)
glBufferData(target,size,NULL,last_buffer_usage);
GLvoid * ptr = glMapBuffer(target,GL_WRITE_ONLY);
memcpy(ptr, data, size);
glUnmapBuffer(target);
}
In this case, we have, in the special case of completely replacing the VBO, removed the block on the GPU. We know it's safe to simply orphan and splat.

What's interesting about this code is that the API to glBufferSubData is one-way - nothing is returned, so the code above can run in the driver thread, and the inputs to glBufferSubData can easily be marshaled for later use. By keeping the results of glMapBuffer private, we can avoid a stall.

(We have eaten a second memcpy - one to marshall and one to actually blit into the real buffer. So this isn't great for huge amounts of data.)

Anyway, from what I can tell, the latest shipping drivers from NVidia, AMD and Intel all do this - there is no penalty for doing a full glBufferSubData, and in the case of NVidia, it goes significantly faster than orphan+map.

A glBufferSubData update like this is sometimes referred to as "in-band" - it can happen either by the driver queuing a DMA to get the data into place just in time (in-band in the commands stream) or by simply renaming the resource (that is, using separate memory for each version of it).

Using glBufferSubData on Uniforms

The test case I was looking at was with uniform buffer objects. Streaming uniforms are a brutal case:

A very small amount of data is going to get updated nearly every draw call - the speed at which we update our uniforms basically determines our draw call rate, once we avoid knuckle-headed stuff like changing shaders a lot.
Loose uniforms perform quite well on Windows - but it's still a lot of API traffic to update uniforms a few bytes at a time.
glMapBuffer is almost certainly too expensive for this case.

We have a few options to try to get faster uniform updates:

glBufferSubData does appear to be viable. In very very limited test cases it looks the same or slightly faster than loose uniforms for small numbers of uniforms. I don't have a really industrial test case yet. (This is streaming - we'd expect a real win when we can identify static uniforms and not stream them at all.)
If we can afford to pre-build our UBO to cover multiple draw calls, this is potentially a big win, because we don't have to worry about small-batch updates. But this also implies a second pass in app-land or queuing OpenGL work.**
Another option is to stash the data in attributes instead of uniforms. Is this any better than loose uniforms? It depends on the driver. On OS X attributes beat loose uniforms by about 2x.

Toward this last point, my understanding is that some drivers need to allocate registers in your shaders for all attributes, so moving high-frequency uniforms to attributes increases register pressure. This makes it a poor fit for low-frequency uniforms. We use attributes-as-uniforms in X-Plane for a very small number of parameters where it's useful to be able to change them at a frequency close to the draw call count.

I'm working on a comprehensive test engine now to assess performance on every driver stack I have access to. When I have complete data, I'll write up a post.

* The one case that is pathological is the AMD Catalyst 13-9 drivers - the last ones that support pre-DX11 cards. In those cards, there is no caching of buffer mappings, so using map buffer at high frequency is unshipable. The current AMD glMapBuffer implementation for DX11 cards appears to have similar overhead to NVidia's.

* This is a case we can avoid in the next-gen APIs; since command buffers are explicitly enqueued, we can leave our UBO open and stream data into it as we write the command buffer, and know that we won't get flushed early. OpenGL's implicit flush makes this impossible.

17 comments:

Rick6/23/2015 2:23 AM
What I'm missing in the post is any reference to GL_MAP_PERSISTENT_BIT. Have you tried that?
ReplyDelete
Replies
Benjamin Supnik6/23/2015 9:31 AM
I haven't tried that yet - it's on my todo list for the next factor of the code - trying persistent mapping, map buffer range, pre-filling, and in-band buffer subdata. I am hopeful that map-persistent will be a top performer while having good flexibility. I'll post the results when I get them.
ReplyDelete
Replies
pixeljetstream7/07/2015 4:58 PM
indeed glMapBuffer is not so cool... unless persistent.

had pretty good results with persistent for vertex data streaming. N buffers recycled every N frames. Make sure to omit the MAP_READ and COHERENT bits, as they may trigger slower type of memory on NVIDIA.

subdata is quite optimized for UBOs see http://on-demand.gputechconf.com/siggraph/2014/presentation/SG4117-OpenGL-Scene-Rendering-Techniques.pdf

there is also some core principles of next-gen apis already available in GL http://on-demand.gputechconf.com/gtc/2015/presentation/S5135-Christoph-Kubisch-Pierre-Boudier.pdf
ReplyDelete
Replies
RobbieS10/20/2015 3:06 PM
Back when I was an OGL driver engineer, we'd have debates between using MapBuffer[Range] and Buffer[Sub]Data. For our implementation, we definitely preferred Buffer[Sub]Data, because of our multi-threaded driver. However, our DevTech wanted us to promote MapBuffer[Range], because it followed along with the D3D way of doing things for porting reasons (as D3D promoted using Map). But D3D had the advantage of the D3D10_MAP_FLAG_DO_NOT_WAIT bit, which is functionality that OGL did not expose.

On top of that, we had the fun of developers conflating D3D10_MAP_FLAG_DO_NOT_WAIT with GL_MAP_UNSYNCHRONIZED_BIT, which was...awesome.
ReplyDelete
Replies
Benjamin Supnik10/23/2015 4:13 PM
I've read the Mantle programming guide maybe twice now, and what they have is similar: they expose the idea that your textures are going to have to go through the command processor (maybe on a DMA queue for new hw that has that) so that the hardware can do tile twizzling on the image format. You can never directly map the image since it's not linear, etc. etc.

The mantle stuff is vague on constant buffers too - it appears that there is:

1. A slow path when you have to update your descriptor table - the rules for mapping and unmapping descriptor tables are strict so they're only going to be fast if you can write out one giant descriptor table and use it for a big chunk of the frame.

2. A fast path for changing -one- memory descriptor - you apparently get to edit one memory descriptor on the fly - this could be a fast path for changing the window for constants that are streaming per-draw.

For DX12, you can update the root signature quickly per draw call, in theory...I've read a bunch of IHV recommendations and they're totally all over the map. :-)

I looked at the GCN docs and it looks like you can get 16 D-words loaded directly into a vertex shader from the command processor (the SH user regs) and vary their contents per-draw call cheaply. So I'm guessing that:
- Mantle uses two of them as a base pointer to the dynamic descriptor and
- DX12 maps the first 16 d-words of the root signature there.

Anyway, my assumption is that we'll definitely be able to:
1. Write high-frequency-update uniforms directly to a persistently mapped buffer, and
2. cheaply move the base pointer per draw call.

My guess is that it will not be practical to push per-draw call info "in-band" due to pressure on the very small amount of data the hardware can manage per draw call.
- 16 d-words doesn't go very far as attributes on AMD hw.
- It looks like the Intel hw has to window the whole root signature and therefore making it bigger gets expensive.
- I don't know what the green team has under the hood.
ReplyDelete
Replies
Unknown10/26/2015 5:11 AM
Speaking of register pressure, will instanced attributes act any differently? I tried placing 24 floats in 6 instanced attributes, and it was faster than fetching from an array within a uniform buffer by gl_InstanceID.
ReplyDelete
Replies
Daniel3/01/2017 11:54 AM
(I hope you don't mind me commenting on this old Blogpost)

The pseudocode for the more aggressive version of glBufferSubData() made me wonder if there is a (semantical and practical, from common implementations) difference between glBufferSubData(target, 0, fullsize, data) and glBufferData(target, data, fullsize, last_buffer_usage)?

Related: Does doing glBufferData() with data=NULL and then glMapBuffer() + memcpy() (like you did) have any advantage over just calling glBufferData() with the actual data directly? (I'd imagine the libGL doing the same thing internally in both cases, but I'm quite new to OpenGL and graphics programming in general so my mental model of the whole thing is probably pretty flawed)
ReplyDelete
Replies
elect8/06/2018 9:33 AM
This comment has been removed by a blog administrator.
ReplyDelete
Replies