TL;DR: when streaming uniforms, glMapBuffer is not a great idea; glBufferSubData may actually work well in some cases.
I just fixed a nasty performance bug in X-Plane, and what I found goes directly against stuff I posted here, so I figured a new post might be in order.
A long time ago I more or less wrote this:
Everything about not blocking on the GPU with map buffer is still true - if you do a synchronized map buffer, you're going to block hard. Never do that!
But...these days on Windows, the OpenGL driver is running in a separate thread from your app. When you issue commands, it just marshals them into a FIFO as fast as it can and returns. The idea is to keep the app rendering time and driver command buffer assembly from being sequential.
The first problem is: glMapBuffer has to return an actual buffer pointer to you! Since your thread isn't actually doing real work, this means one of two things:
A naive implementation of glBufferSubData might look like this:
void glBufferSubData(GLenum target, GLintptr offset, GLsizeiptr size, const GLvoid * data)
{
GLvoid * ptr = glMapBuffer(target,GL_WRITE_ONLY);
memcpy(ptr, data, size);
glUnmapBuffer(target);
}
The synchronized map buffer up top is what gets you a stall on the GPU, the thing I was suggesting is "really really bad" five years ago.
But what if we want to be a little bit more aggressive?
void glBufferSubData(GLenum target, GLintptr offset, GLsizeiptr size, const GLvoid * data)
{
if(offset == 0 && size == size_of_currently_bound_vbo)
glBufferData(target,size,NULL,last_buffer_usage);
GLvoid * ptr = glMapBuffer(target,GL_WRITE_ONLY);
memcpy(ptr, data, size);
glUnmapBuffer(target);
}
In this case, we have, in the special case of completely replacing the VBO, removed the block on the GPU. We know it's safe to simply orphan and splat.
What's interesting about this code is that the API to glBufferSubData is one-way - nothing is returned, so the code above can run in the driver thread, and the inputs to glBufferSubData can easily be marshaled for later use. By keeping the results of glMapBuffer private, we can avoid a stall.
(We have eaten a second memcpy - one to marshall and one to actually blit into the real buffer. So this isn't great for huge amounts of data.)
Anyway, from what I can tell, the latest shipping drivers from NVidia, AMD and Intel all do this - there is no penalty for doing a full glBufferSubData, and in the case of NVidia, it goes significantly faster than orphan+map.
A glBufferSubData update like this is sometimes referred to as "in-band" - it can happen either by the driver queuing a DMA to get the data into place just in time (in-band in the commands stream) or by simply renaming the resource (that is, using separate memory for each version of it).
I'm working on a comprehensive test engine now to assess performance on every driver stack I have access to. When I have complete data, I'll write up a post.
* The one case that is pathological is the AMD Catalyst 13-9 drivers - the last ones that support pre-DX11 cards. In those cards, there is no caching of buffer mappings, so using map buffer at high frequency is unshipable. The current AMD glMapBuffer implementation for DX11 cards appears to have similar overhead to NVidia's.
* This is a case we can avoid in the next-gen APIs; since command buffers are explicitly enqueued, we can leave our UBO open and stream data into it as we write the command buffer, and know that we won't get flushed early. OpenGL's implicit flush makes this impossible.
I just fixed a nasty performance bug in X-Plane, and what I found goes directly against stuff I posted here, so I figured a new post might be in order.
A long time ago I more or less wrote this:
- When you want to stream new data into a VBO, you need to either orphan it (e.g. get a new buffer) or use the new (at the time) unsynchronized mapping primitives and manage ranges of the buffer yourself.
- If you don't do one of these two things, you'll block your thread waiting for the GPU to be done with the data that was being used before.
- glBufferSubData can't do any better, and is probably going to do worse.
Everything about not blocking on the GPU with map buffer is still true - if you do a synchronized map buffer, you're going to block hard. Never do that!
But...these days on Windows, the OpenGL driver is running in a separate thread from your app. When you issue commands, it just marshals them into a FIFO as fast as it can and returns. The idea is to keep the app rendering time and driver command buffer assembly from being sequential.
The first problem is: glMapBuffer has to return an actual buffer pointer to you! Since your thread isn't actually doing real work, this means one of two things:
- Blocking the app thread until the driver actually services the requests, then returning the result. This is bad. I saw some slides a while back where NVidia said that this is what happens in real life.
- In theory under just the right magic conditions glMapBuffer could return scratch memory for use later. It's possible under the API if a bunch of stuff goes well, but I wouldn't count on it. For streaming to AGP memory, where the whole point was to get the real VBO, this would be fail.
Can glBufferSubData Do Better?
Here's what surprised me: in at least one case, glBufferSubData is actually pretty fast. How is this possible?A naive implementation of glBufferSubData might look like this:
void glBufferSubData(GLenum target, GLintptr offset, GLsizeiptr size, const GLvoid * data)
{
GLvoid * ptr = glMapBuffer(target,GL_WRITE_ONLY);
memcpy(ptr, data, size);
glUnmapBuffer(target);
}
The synchronized map buffer up top is what gets you a stall on the GPU, the thing I was suggesting is "really really bad" five years ago.
But what if we want to be a little bit more aggressive?
void glBufferSubData(GLenum target, GLintptr offset, GLsizeiptr size, const GLvoid * data)
{
if(offset == 0 && size == size_of_currently_bound_vbo)
glBufferData(target,size,NULL,last_buffer_usage);
GLvoid * ptr = glMapBuffer(target,GL_WRITE_ONLY);
memcpy(ptr, data, size);
glUnmapBuffer(target);
}
In this case, we have, in the special case of completely replacing the VBO, removed the block on the GPU. We know it's safe to simply orphan and splat.
What's interesting about this code is that the API to glBufferSubData is one-way - nothing is returned, so the code above can run in the driver thread, and the inputs to glBufferSubData can easily be marshaled for later use. By keeping the results of glMapBuffer private, we can avoid a stall.
(We have eaten a second memcpy - one to marshall and one to actually blit into the real buffer. So this isn't great for huge amounts of data.)
Anyway, from what I can tell, the latest shipping drivers from NVidia, AMD and Intel all do this - there is no penalty for doing a full glBufferSubData, and in the case of NVidia, it goes significantly faster than orphan+map.
A glBufferSubData update like this is sometimes referred to as "in-band" - it can happen either by the driver queuing a DMA to get the data into place just in time (in-band in the commands stream) or by simply renaming the resource (that is, using separate memory for each version of it).
Using glBufferSubData on Uniforms
The test case I was looking at was with uniform buffer objects. Streaming uniforms are a brutal case:- A very small amount of data is going to get updated nearly every draw call - the speed at which we update our uniforms basically determines our draw call rate, once we avoid knuckle-headed stuff like changing shaders a lot.
- Loose uniforms perform quite well on Windows - but it's still a lot of API traffic to update uniforms a few bytes at a time.
- glMapBuffer is almost certainly too expensive for this case.
- glBufferSubData does appear to be viable. In very very limited test cases it looks the same or slightly faster than loose uniforms for small numbers of uniforms. I don't have a really industrial test case yet. (This is streaming - we'd expect a real win when we can identify static uniforms and not stream them at all.)
- If we can afford to pre-build our UBO to cover multiple draw calls, this is potentially a big win, because we don't have to worry about small-batch updates. But this also implies a second pass in app-land or queuing OpenGL work.**
- Another option is to stash the data in attributes instead of uniforms. Is this any better than loose uniforms? It depends on the driver. On OS X attributes beat loose uniforms by about 2x.
I'm working on a comprehensive test engine now to assess performance on every driver stack I have access to. When I have complete data, I'll write up a post.
* The one case that is pathological is the AMD Catalyst 13-9 drivers - the last ones that support pre-DX11 cards. In those cards, there is no caching of buffer mappings, so using map buffer at high frequency is unshipable. The current AMD glMapBuffer implementation for DX11 cards appears to have similar overhead to NVidia's.
* This is a case we can avoid in the next-gen APIs; since command buffers are explicitly enqueued, we can leave our UBO open and stream data into it as we write the command buffer, and know that we won't get flushed early. OpenGL's implicit flush makes this impossible.