Thursday, June 18, 2015

glMapBuffer No Longer Cool

TL;DR: when streaming uniforms, glMapBuffer is not a great idea; glBufferSubData may actually work well in some cases.

I just fixed a nasty performance bug in X-Plane, and what I found goes directly against stuff I posted here, so I figured a new post might be in order.

A long time ago I more or less wrote this:

  • When you want to stream new data into a VBO, you need to either orphan it (e.g. get a new buffer) or use the new (at the time) unsynchronized mapping primitives and manage ranges of the buffer yourself.
  • If you don't do one of these two things, you'll block your thread waiting for the GPU to be done with the data that was being used before.
  • glBufferSubData can't do any better, and is probably going to do worse.
Five years is a long time in GPU history, and those rules don't quite apply.

Everything about not blocking on the GPU with map buffer is still true - if you do a synchronized map buffer, you're going to block hard.  Never do that!

But...these days on Windows, the OpenGL driver is running in a separate thread from your app. When you issue commands, it just marshals them into a FIFO as fast as it can and returns. The idea is to keep the app rendering time and driver command buffer assembly from being sequential.

The first problem is: glMapBuffer has to return an actual buffer pointer to you! Since your thread isn't actually doing real work, this means one of two things:

  1. Blocking the app thread until the driver actually services the requests, then returning the result. This is bad. I saw some slides a while back where NVidia said that this is what happens in real life.
  2. In theory under just the right magic conditions glMapBuffer could return scratch memory for use later. It's possible under the API if a bunch of stuff goes well, but I wouldn't count on it. For streaming to AGP memory, where the whole point was to get the real VBO, this would be fail.
It should also be noted at this point that, at high frequency, glMapBuffer isn't that fast. We still push some data into the driver via client arrays (I know, right?) because when measuring unsynchronized glMapBufferRange vs just using client arrays and letting the driver memcpy, the later was never slower and in some cases much faster.*

Can glBufferSubData Do Better?

Here's what surprised me: in at least one case, glBufferSubData is actually pretty fast. How is this possible?

A naive implementation of glBufferSubData might look like this:
void glBufferSubData(GLenum target, GLintptr offset, GLsizeiptr size, const GLvoid * data)
GLvoid * ptr = glMapBuffer(target,GL_WRITE_ONLY);
memcpy(ptr, data, size);
The synchronized map buffer up top is what gets you a stall on the GPU, the thing I was suggesting is "really really bad" five years ago.

But what if we want to be a little bit more aggressive?
void glBufferSubData(GLenum target, GLintptr offset, GLsizeiptr size, const GLvoid * data)
if(offset == 0 && size == size_of_currently_bound_vbo)
GLvoid * ptr = glMapBuffer(target,GL_WRITE_ONLY);
memcpy(ptr, data, size);
In this case, we have, in the special case of completely replacing the VBO, removed the block on the GPU. We know it's safe to simply orphan and splat.

What's interesting about this code is that the API to glBufferSubData is one-way - nothing is returned, so the code above can run in the driver thread, and the inputs to glBufferSubData can easily be marshaled for later use.  By keeping the results of glMapBuffer private, we can avoid a stall.

(We have eaten a second memcpy - one to marshall and one to actually blit into the real buffer. So this isn't great for huge amounts of data.)

Anyway, from what I can tell, the latest shipping drivers from NVidia, AMD and Intel all do this - there is no penalty for doing a full glBufferSubData, and in the case of NVidia, it goes significantly faster than orphan+map.

A glBufferSubData update like this is sometimes referred to as "in-band" - it can happen either by the driver queuing a DMA to get the data into place just in time (in-band in the commands stream) or by simply renaming the resource (that is, using separate memory for each version of it).

Using glBufferSubData on Uniforms

The test case I was looking at was with uniform buffer objects.  Streaming uniforms are a brutal case:

  • A very small amount of data is going to get updated nearly every draw call - the speed at which we update our uniforms basically determines our draw call rate, once we avoid knuckle-headed stuff like changing shaders a lot.
  • Loose uniforms perform quite well on Windows - but it's still a lot of API traffic to update uniforms a few bytes at a time.
  • glMapBuffer is almost certainly too expensive for this case.
We have a few options to try to get faster uniform updates:

  1. glBufferSubData does appear to be viable. In very very limited test cases it looks the same or slightly faster than loose uniforms for small numbers of uniforms. I don't have a really industrial test case yet. (This is streaming - we'd expect a real win when we can identify static uniforms and not stream them at all.)
  2. If we can afford to pre-build our UBO to cover multiple draw calls, this is potentially a big win, because we don't have to worry about small-batch updates. But this also implies a second pass in app-land or queuing OpenGL work.**
  3. Another option is to stash the data in attributes instead of uniforms. Is this any better than loose uniforms? It depends on the driver.  On OS X attributes beat loose uniforms by about 2x.
Toward this last point, my understanding is that some drivers need to allocate registers in your shaders for all attributes, so moving high-frequency uniforms to attributes increases register pressure. This makes it a poor fit for low-frequency uniforms. We use attributes-as-uniforms in X-Plane for a very small number of parameters where it's useful to be able to change them at a frequency close to the draw call count.

I'm working on a comprehensive test engine now to assess performance on every driver stack I have access to. When I have complete data, I'll write up a post.

* The one case that is pathological is the AMD Catalyst 13-9 drivers - the last ones that support pre-DX11 cards. In those cards, there is no caching of buffer mappings, so using map buffer at high frequency is unshipable.  The current AMD glMapBuffer implementation for DX11 cards appears to have similar overhead to NVidia's.

* This is a case we can avoid in the next-gen APIs; since command buffers are explicitly enqueued, we can leave our UBO open and stream data into it as we write the command buffer, and know that we won't get flushed early.  OpenGL's implicit flush makes this impossible.

Wednesday, June 10, 2015

OS X Metal - Raw Notes

Just a few raw notes about Metal as I go through the 2015 WWDC talks. This mostly discusses what has been added to Metal to make it desktop-ready.

Metal for desktop has instancing, sane constant buffers, texture barrier, occlusion query, and draw-indirect. It looks like it does not have transform feedback, geometry shaders or tessellation. (The docs do mention outputting vertices to a buffer with a nil fragment function, but I don't see a way to specify the output buffer for vertex transform. I also don't see any function attach points for geometry shaders or tessellation shaders.)

The memory model is "common use cases" - that is, they give you a few choices:
  • Shared - one copy of the data in AGP memory (streaming/dynamic in OpenGL). Coherency is apparently at the command buffer, not continuous, so this apparently does not rely on map-coherent (and I would assume has lower overhead).
  • Managed - what we have now for static geometry: a CPU side and GPU side copy that are synced. Sync is explicit by flushing CPU side changes (via didModifyRange).  Reading back changes from the GPU is explicit -and- queued via a synchronizeResource call on the blit command encoder. For shared memory devices (e.g. Intel?) there's only one copy (e.g. this backs off to "shared").
  • Private - the memory is entirely on the GPU side (possibly in VRAM) - the win here is that the format can be tiled/swizzled/whatever is fastest for the GPU. Therefore this is the right storage option for framebuffers.  Access to/from the data is only from blit command encoder operations.
  • Auto - a meta-format for textures - turns into shared on IOS (which has no managed) and managed on desktop - so that cross-platform code can do one thing everywhere. (This seems odd to me, because desktop apps will want to have some meshes be managed too.)
The caching model (e.g. write-combined) is a separate flag on buffer objects. Mapping is always available and persistent.

There appears to be no access to the parallel command queues that modern GCN2 devices have. Modern GPUs can run blit/DMA operations in parallel with rendering operations; this uses the hardware more efficiently but also introduces an astonishing amount of complexity into APIs like Mantle that tell developers "here's two async queues - good luck staying coherent."

Perhaps Metal internally offloads blit command buffers to the DMA queue and inserts a wait in the rendering encoder for the bad-luck case where the blit doesn't finish enough ahead of time.

Unlike Mantle, there is no requirement to manually manage the reference pool for the resources that a command queue has access to - this also simplifies things.

Finally, Mantle has a more complex memory model; in Mantle, you get big pools of memory from the driver and then jam resources into them yourself, letting games create pool allocators as desired. In Metal, everything's just a resource; you don't really know how much VRAM you have access to or how stuffed it is, and paging the managed pool is entirely within the driver.  (One exception: you can create a buffer directly off of a VM page with no copy, but this still isn't the same as what Mantle gives you.)

As the guy who would have to code to these APIs, I definitely like Apple's simpler, more automatic model a lot more than the state change rules for Mantle; naively (having not coded it) it seems like the app logic to handle state change would be either very complex and finicky or non-optimal. But I'd have to know what the cost in performance is of letting the driver keep these tasks. Apple's showing huge performance wins over OpenGL, but that doesn't validate the idea of driver-managed resource coherency; you'd have to compare to Mantle to see who should own the task.

I can imagine AAA game developers being annoyed that there isn't a pooling abstraction like in Mantle, since the "pool of memory you subdivide" model works well with what console games do on their own to keep memory use under control.

Overall, based on what I've read of the API, Metal looks a solid, well-thought-out next generation API; similar to Mantle in how it reorganizes work-flow, but less complex. It's still missing some modern desktop GPU functionality, but moving Metal to discrete hardware with discrete memory hasn't turned the API into a swamp.

Finally, from an adoption stand-point: it looks to me that Metal on OS X gives Apple a way to try to leverage its strong position in mobile gaming to move titles to the desktop, which is a more viable sell than trying to use a more modern OpenGL to move titles from PC. (Only having a DirectX clone would help with that.)

Friday, May 22, 2015

Underestanding PowerVR GPUs via Metal

In my previous post I suggested that OpenGL and OpenGL ES, as APIs, don't always fit the underlying hardware. One way to understand this is to read GPU hardware documentation - AMD is pretty good about posting hardware specs, e.g. ISAs, register listings, etc. You can also read the extensions and see the IHV trying to bend the API to be closer to the hardware (see NVidia's big pile of bindless this and bindless that). But both these ways of "studying" the hardware are time consuming and not practical if you don't do 3-d graphics full time.

Recently there has been a flood of new low-level, close-to-the-hardware APIs: metal (Apple, PowerVR), Mantle (AMD, GCN), Vulkan (Khronos, everything), DirectX 12 (Microsoft, desktop GPUs). This provides us another way of understanding the hardware: we can look at what the graphics API would look like if it were rewritten to match today's hardware.

Let's take a look at some Metal APIs and see what they tell us about the PowerVR hardware.

Mutability Is Expensive

A texture in Metal is referenced via an MTLTexture object.* Note that while it has properties to get its dimensions, there is no API to change its size! Instead you have to fill in a new MTLTextureDescriptor and use that to make a brand new MTLTexture object.

In graphics terms, the texture is immutable. You can change the contents of its image, but you can't change the object itself in such a way that the underlying hardware resources and shader instructions associated with the texture have to be altered.

This is a win for the driver: when you go to use an MTLTexture, whatever was true about the texture last time you used it is still true now, always.

Compare this to OpenGL. With OpenGL, you can bind the texture id to a new texture - not only with different dimensions, but maybe of a totally different type. Surprise, OpenGL - that 2-d texture I used is now a cube map! Because anything can change at any time, OpenGL has to track mutations and re-check the validity of bound state when you draw.

Commands Are Assembled in Command Buffers and Then Queued for the GPU

How do your OpenGL commands actually get to the GPU? The OpenGL way involves a fair amount of witchcraft:
  1. You make an OpenGL context current to a thread.
  2. You issue function calls into the OpenGL API.
  3. "Later" stuff happens. If you never call glFlush, glFinish, or some kind of swap command, maybe some of your commands never execute.
That's definitely not how the hardware works. Again, Metal gives us a view of the underlying implementation.

On every modern GPU where I've been able to find out how command processing works, the GPU follows pretty much the same design:
  1. The driver fills in a command buffer - that is, a block of memory with GPU commands (typically a few bytes each) that tell the GPU what to do. The GPU commands don't match the source API - there will typically be commands for draw calls, setting up registers on the GPU, and that might be it.
  2. The driver queues completed command buffers for the GPU to run in some kind of order. The GPU might DMA the command buffer into its own space, or it might read it out of system memory.
Metal exposes this directly: MTLCommandBuffer represents a single command buffer, and MTLCommandQueue is where you queue it once you're done encoding it and you want the GPU to operate on it.

It turns out a fair amount of the CPU time the driver spends goes into converting your OpenGL commands into command buffers.  Metal exposes this too via specific MTLCommandEncoder subclasses. We can now see this work directly.

When you issue OpenGL commands, the encoder is built into the context, is "discovered" via your current thread, and commands are sent to a command buffer that is allocated on the fly. (If you really push the API hard, some OpenGL implementations can block in random locations because the context's encoder can't get a command buffer.)

The OpenGL context also has access to a queue internally, and will queue your buffer when (1) it fills up or (2) you call one of glFlush/glFinish/swap. This is why your commands might not start executing until you call flush - if the buffer isn't full, OpenGL will leave it around, waiting for more commands.

One last note: the race condition between the CPU writing commands and the GPU reading them is handled by a buffer being in only one place at a time, whether it's the CPU (encoding commands) or GPU (executing them) - this is true for both Metal and GLES. So while you are encoding commands, the GPU has not started on them yet.

Normally this is not a problem - you queue up a ton of work and the GPU always has a long todo list. But in the non-ideal case where GPU latency matters (e.g. you want the answer as fast as possible), in OpenGL ES you might have to issue a flush so the GPU can start - OpenGL will then get you a new command buffer. (This is why the GL spec has all of that language about glFlush ensuring that commands will complete in finite time - until you flush, the command buffer is just sitting there waiting for the driver to add more to it.)

The GPU Does Work When You Start and End Rendering to a Surface

As I am sure you have read 1000 times, the PowerVR GPUs are tiled deferred renderers. What this means is that rasterizing and fragment shading are done on tiny 32x32 pixel tiles of the screen, one at at time. (The tile size might be different - I haven't found a good reference.) For each rendering pass, the GPU iterates on each tile of the surface and renders everything in the rendering pass that intersects that tile.

The PowerVR GPUs are designed this way so that they can function without high-speed VRAM tied to a high-bandwidth memory bus. Normal desktop GPUs use a ton of memory bandwidth, and that's a source of power consumption.  The PowerVR GPUs have a tiny amount of on-chip video memory; for each tile the surface is loaded into this cache, fully shaded (with multiple primitives) and then saved back out to shared memory (e.g. the surface itself).**

This means the driver has to understand the bounded set of drawing operations that occur for a single surface, book-ended by a start and end. The driver also has to understand the life-cycle of this rendering pass: do we need to load the surface from memory to modify it, or can we just clear it and draw? What results actually need to be saved?  (You probably need your color buffer when you're done drawing, but maybe not the depth buffer. If depth was just used for hidden surface removal, you can skip saving it to memory.) Optimizing the start and end of a surface rendering pass saves a ton of bandwidth.

Metal lets you specify how a rendering pass will work explicitly: an MTLRenderPassDescriptor describes the surfaces you will render to and exactly how you want them to be loaded and stored. You can explicitly specify that the surface be loaded from memory, cleared, or whatever is fastest; you can also explicitly store the surface, use it for an FSAA resolve, or discard it.

To get a command encoder to render (a MTLRenderCommandEncoder), you have to pass a MTLRenderPassDescriptor describing how a pass is book-ended and what surfaces are involved. You can't not answer the question.

Compare this to OpenGL ES; when you bind a new surface for drawing, the driver must note that it doesn't know how you want the pass started. It then has to track any drawing operation (which will implicitly load the surface from memory) as well as a clear operation (which will start by clearing). Lots of book-keeping.

The Entire Pipeline Is Grafted Onto Your Shader

OpenGL encourages us to think of the format of our vertex data as being part of the vertex data, because we use glVertexAttribPointer to tell OpenGL how our vertices are read from a VBO.

This view of vertex fetching is misleading; glVertexAttribPointer really wraps up two very different bits of information:

  • Where to get the raw vertex data (we need to know the VBO binding and base pointer) and
  • How to fetch and interpret that data (for which we need to know the data type, stride, and whether normalization is desired).
The trend in recent years is for GPUs to do vertex fetch "in software" as part of the vertex shader, rather than have fixed function hardware or registers that do the fetch. Moving vertex fetch to software is a win because the hardware already has to support fast streamed cached reads for compute applications, so some fixed function transistors can be thrown overboard to make room for more shader cores.

On the desktop, blending is still fixed function, but on the Power VR, blending and write-out to the framebuffer is done in the shader as well.  (For a really good explanation of why blending hasn't gone programmable on the desktop, read this.  Since the currently rendered tile is cached on chip on the PowerVR, you can see why the arguments about latency and bandwidth from desktop don't apply here, making blend-in-shader a reasonable idea.)

The sum of these two facts is: your shader actually contains a bunch of extra code, generated by the driver, on both the front and back.

Metal exposes this directly with a single object: MTLRenderPipelineState. This object wraps up the actual complete GPU pipeline with all of the "extra" stuff included that you wouldn't know about in OpenGL. Like most GPU objects, the pipeline state is immutable and is created with a separate MTLRenderPipelineDescriptor object. We can see from the descriptor that the pipeline locks down not only the vertex and fragment functions, but also the vertex format and anti-aliasing properties for rasterization. Color mask and blending is in the color attachment descriptor, so that's part of the pipeline too.

Every time you change the vertex format (or even pretend to by changing the vertex base pointer with glVertexAttribPointer), every time you change the color write mask, or change blending, you're requiring a new underlying pipeline to be built for your GLSL shader. Metal exposes the actual pipeline, allowing for greater efficiency. (In X-Plane, for example, we always tie blending state to the shader, so a pipeline is a pretty good fit.)

If there's a summary here, it's that GLES doesn't quite match the PowerVR chip, and we can see the mismatch by looking at Metal. In almost all cases, the driver has to do more work to make GLES fit the hardware, inferring and guessing the semantics of our application.

I'll do one more post in this series, looking at Mantle, and some of the terrifying things we've never had to worry about when running OpenGL on AMD's GCN architecture.

* Technically the real API objects are all ObjC protocols, while the lighter-weight struct-like entities are objects. I'll call them all objects here - to client code they might as well be. The fact that API-created objects are protocols stops you from trying to alloc/init them.

** Besides saving bus bandwidth, this technique also saves shading ops. Because the renderer ha access to the entire rendering pass before it fills in a tile, it can re-order opaque triangles for perfect front-to-back rendering, leveraging early Z rejection.

Tuesday, April 14, 2015

The OpenGL Impedance Mismatch

As graphics hardware has changed from a fixed function graphics pipeline to a general purpose parallel computing architecture, mid-level graphic APIs like OpenGL don't fit the execution model of the actual hardware as well as they used to.

In my previous post, I said that the execution of GL state change is deferred so that the driver can figure out what you'r really trying to do and efficiently change all state at once.

This has been true for a while. For example, older fixed function and partly programmable GPUs might have one set of register state to control the entire fixed-function raster operations.  Here's the R300 (e.g. the Radeon 9700).

  • The blend function and sources share a single register, but
  • The alpha and RGB blend function/sources are in different registers (meaning a single glBlendFuncSeparate partly updates both).
  • Alpha-blend enable shares a register with the flag to separate the blender functions. (Why the hardware doesn't just always run separate and let the driver update both sides of the blender is a mystery to me.)
  • Some GL state actually matches the register (e.g. the clear color is its own register).
So the match-up between imaginary ideal GL pipeline and the hardware isn't perfect. But in the end, the fit is actually pretty good:

  • Fixed function tricks like blending and stenciling are enabled by setting registers on the GPU.
  • Uniforms for a given shader live on the chip while the shader is executing.
  • The vertex fetcher is fixed functionality that is set up by register.
There's a lot written about AMD's Graphics Core Next (GCN) architecture, the GPU inside the Radeon 7900 and friends.  Since GCN GPUs are in both the X-Box One and Playstation 4 and AMD is reasonably loose with chip documentation and disassembling compilers, we know a lot about how the hardware really works.  And the not so snug.

  • Shader constants come from memory (this has been true for a while now) - this is a good fit for a UBOs but a bad fit for "loose uniforms" that are tied to the shader object.  On the GPU, the shader object and uniforms are fully separable.
  • Vertex fetch is entirely in the shader - the driver writes a pre-amble for you.  Thus changing the vertex alignment format (but not the base address) is a shader edit!  Ouch.
  • For shaders that write to multiple render targets, OpenGL lets us remap them via glDrawBuffers, but this export mapping is part of the fragment shader, so that's a shader edit too.
Those shader edits are particularly scary - this is a case where we (the app) think we're doing something orthogonal to the shading pipeline (e.g. just setting up a new VBO) but in practice, we're getting a full shader change.

In fact, the impedance match makes this even worse: if we're going to have any hope of changing state quickly, the driver has to track past combinations of vertex layout, MRT indirection, and the actual GLSL linked program, and cache the "real" shader that backs this combined state.  Each time we change the front-end vertex fetch format or back-end MRT layout, the driver has to go see if that combination exists in cache.

The back-end MRT layout isn't the worst problem because we are hopefully not going to change rendering targets that frequently.  But the vertex format is a real mess; every call to glVertexAttribPointer potentially invalidates the vertex layout; the driver can either try to heavily check state change, or regenerate the shader front-end; both options stink.

You can see OpenGL trying to track the moving target of the hardware in the extensions: GL_ARB_vertex_array_object was made part of core OpenGL 3.0 and ties up the entire vertex fetch plus base pointer in a single "object" for quick recall.  But we can see that this is now a pretty poor fit; half of the state that the VAO covers (the layout) is really part of the shader, while the other half (the actual address of the VBO plus offset) is separate.*

A newer extension, GL_ARB_vertex_attrib_binding, separates the vertex format (which is part of the shader in hardware) from the actual data location; it was made part of OpenGL 4.3. I don't know how good of a fit this is; the vertex attribute binding leaves the data stride out of the "expensive" format binding.  (My guess is that the intended implementation is to specify the data stride as a constant in a constant buffer somewhere.) In theory with this extension, only glVertexAttribFormat requires an expensive shader patch, and applications can change VBO sources without calling it.

If there's an executive summary here, it's that OpenGL as an API has never been a perfect representation of what the hardware is doing, but as the hardware moves toward general purpose compute devices that work on buffers of memory, the pipeline-and-state model fits less and less.

In my next posts I'll take a look at Metal and Mantle - these new APIs let us take the red pill and see how deep the rabbit hole goes.

* I am of the opinion that VAOs were a mistake from day one.  VAOs are mutable to allow them to be 'layered' on top of existing code the way VBOs were, and even if they weren't, the data location of the VBO is mutable at the driver level (because the VBO may at the time of draw be in VRAM or system memory, and may require a change to the memory map of the CPU that the GPU holds to draw, or it may require a DMA copy to move it to RAM).  The result is that binding a VAO doesn't let you skip the tons of validation and synchronization needed to actually start drawing once the base pointers have been moved.

OpenGL State Change Is Deferred

This is totally obvious to developers who have been coding high performance OpenGL for years, but it might not be obvious to newer developers starting with OpenGL or OpenGL ES, so...

In pretty much any production OpenGL driver, the real 'work' of OpenGL state change is deferred - that work is executed on the next draw call (e.g. glDrawElements or glDrawArrays).

This is why, when you profile your code, glBindBuffer and glVertexPointer appear to be "really fast" and yet "glDrawArrays" is using a ton of CPU Time.

The work of setting up the hardware for GL state is deferred because often the state cannot be set up until multiple calls come in.

Let's take as an example, vertex format.  You do this:
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 32, (char *) 0);
glVertexAttribPointer(1, 3, GL_FLOAT, GL_FALSE, 32, (char *) 12);
glVertexAttribPointer(2, 2, GL_FLOAT, GL_FALSE, 32, (char *) 24);
The way this is implemented on modern GPUs is to generate a subroutine or pre-amble code for your vertex shader that "executes" the vertex fetch based on these stride rules.

There's no point in generating the shader until all of the vertex format is known; if the driver went and patched your shader after the first call (setting up attribute 0) using the old state of attributes 1 and 2, all of that work is wasted and would be redone when the next two glVertexAttribPointer calls come in.

Furthermore, the driver doesn't know when you're done.  There is no glDoneScrewingAroundWithVertexAttribPointer call.

So the driver does the next best thing - it waits for a draw call.  At that point it goes "hey, I know you are done changing state because this draw call uses what you have set now."  At that point it goes and makes any state change that is needed since the last draw call.

What this means is that you can't tell how "expensive" your state change is by profiling the code doing the state change.  The cost of state change when you call it is the cost of recording for later what needs to be done, e.g.
void glBlendFunc(GLenum sfactor, GLenum dfactor)
   context * c = internal_get_thread_gl_context();
   c->blend.sfactor = sfactor;
   c->blend.dfactor = dfactor;
   c->dirty_bits |= bit_blend_mode;
In other words, the driver is just going to record what you said to the current context and make a note that we're "out of sync" state-wise.  The draw call does the heavy lifting:
void glDrawArrays(GLenum mode, GLint first, GLsizei count)
   context * c = internal_get_thread_gl_context();
   if(c->dirty_bits & bit_blend_mode)
     /* this is possibly slow */
   /* more check and sync */
   c->dirty_bits = 0;
   /* do actual drawing work - this isn't too slow */
On Apple's OpenGL implementation, the stack is broken into multiple parts in multiple dylibs, which means an Instruments trace often shows you subroutines with semi-readable names; you can see draw calls updating and synchronizing state.  On Windows the GL stack is monolithic, stripped, and often has no back-trace info, which makes it hard to tell where the CPU is spending time.

One final note: the GL driver isn't trying to save you from your own stupidity.  If you do this:
for(int i = 0; i < 1000; ++i)
   glDrawArrays(GL_TRIANGLES, i*12, 12);
Then every call to glEnable is likely to make the blend state 'dirty' and every call to glDrawArrays is going to spend time re-syncing blend state on the hardware.

Avoid calling state changes that aren't needed even if they appear cheap in their individual function call time - they may be "dirtying" your context and driving up the cost of your draw calls.

Tuesday, March 17, 2015

Accumulation to Improve Small-Batch Drawing

I sometimes see "casual" OpenGL ES developers (e.g. users making 2-d games and other less performance intensive GL applications) hit a performance wall on the CPU side. It starts with the app having something like this:
class gl_helper {
  void draw_colored_triangle_2d(color_t color, int x1, int y1,
    int x2, int y2, int x3, int y3);
  void draw_textured_triangle_2d(color_t color,
    int x1, int y1, int x2, int y2,
    int x3, int y3,
    int tex_x, int tex_y, int tex_width, int tex_height);
  void draw_textured_triangle_3d(color_t color,
    int x1, int y1, int z1, int x2, int y2, int z2,
    int x3, int y3, int z3, int tex_x,
    int tex_y, int tex_width, int tex_height);
You get the idea.  OpenGL ES is "tamed" by making simple functions that do what we want - one primitive at a time.

The results are correct drawing - and truly awful performance.

Why This Is Slow

Why is the above code almost guaranteed to produce slow results when implemented naively? The answer is that 3-d graphics hardware has a high CPU cost to set the GPU up to draw and a very low cost per triangle once you do draw.  So creating an API where each triangle comes in differently and thus must be individually set up maximizes the overhead and minimizes throughput.

A profile of this kind of code will show a ton of time in the actual draw call (e.g. glDrawArrays) but don't be fooled.  The time is really being spent at the beginning of glDrawArrays synchronizing the GPU with the type of drawing you want.*

Cheaper By the Dozen

The Mike Acton way of fixing this is "where there's one, there's many" - this API should allow you to draw lots of triangles, assuming they are all approximately the same.  For example,
void draw_lots_of_colored_triangles(color_t color, int count, float xyz[]); 
would not be an insane API.  At least if the number of triangles gets big, the overhead gets small.

One thing is clear: if your application can generate batched geometry, it absolutely should be sending it to OpenGL in bulk!  You never want to run a for-loop over your big pile of triangles and send them one at a time; if you have a wrapper around OpenGL, make sure you can send the data in without chopping it up first!

When You Can't Consolidate

Unfortunately there are times when you can't actually draw a ton of triangles all at once. It's cute of me to go "oh, performance is easy - just go rewrite all of your drawing code", but this is time consuming and in some cases the app structure itself might make this hard. If you can't design for bulk performance, there is a second option: accumulation.

The idea of accumulation is this: instead of actually drawing all of those individual triangles, you stash them in memory.  You do so in a format that makes it reasonably quick to both:

  1. Save the triangles (so you don't waste time saving and)
  2. Send them all to OpenGL at once.
Here's where the performance win comes from: the accumulator can see that the last 200 triangles were all color triangles with no texture, so it can send them to the GPU with one state setup (for non-textured triangles) and then a single 200-triangle draw call.  This is about 200x more efficient than the naive code.

The accumulator also gives you a place to collect statistics about your application's usage of OpenGL.  If your app is alternating colored and textured triangles, you're going to have to change shaders (even in the accumulator) and it will still be slow.  But you can record statistics in debug mode about the size of the draws to detect this kind of "inefficient ordering."

Similarly, the accumulator can eliminate some calls to the driver to setup state because it knows what it was last doing.  The accumulator does all of its drawing in one shot; if you draw two textured triangles with different textures, the accumulator must stop to change textures (not so good), but it can go "hey, another textured triangle, same pixel shader" and avoid changing pixel shaders (a big win).

Dealing With Inefficient Ordering

So now you have an accumulator, it submits the biggest possible batches of the same kinds of triangles, and it makes the minimum state change calls when the drawing type changes.  And it's still slow. When you look at your usage stats, you find the average draw call size is still only two triangles because the client code is alternating between drawing modes all of the time.

(Maybe your level's building block consists of a textured square background with an additively blended square on top, and this means two triangles of background, state change, two triangle of background, state change again.)

I am assuming that you have already combined your images into a few large textures (texture atlasing) and that you don't have a million tiny textures floating around.  If you haven't atlased your textures, go do it now; I'll wait.

Okay welcome back. When your drawing batch size is still too small even after accumulation, you have two tools to get your batch size back up.

Draw Reordering

The first trick you can try (and you should try this one first) is to give your accumulator the freedom to reorder drawing to achieve better performance.

In our example above, every square in the level had two draws, one on top of the other, and they weren't in the same OpenGL mode.  What we can do is define each draw to be in a different layer, and let the accumulator draw all of layer 0 before any of layer 1.

Once we do that, we find that all of layer 0 is in one OpenGL state (big draw) and all of layer 1 is in the other.  We've relaxed our ordering by giving the accumulator an idea of the real draw ordering we need, rather than the implicit one that comes from the order our code runs.

We actually had just this problem in X-Plane 10 Mobile's user interface; virtually every element was a textured draw of a background element (which uses a simple texturing shader) followed by a draw of text (which uses a special font shader that applies coloring from a two-channel texture).

The result was two shader changes per UI element, and the performance was awful.

We simply modified our accumulator to draw all text after all UI elements; there's a simple "barrier" that can be placed to force stored up text to be output before proceeding (to get major layering of the UI right) but most windows can draw all of their UI elements before any text, cutting down the number of shader changes to two changes total - a big win!

Merging OpenGL State

If you absolutely have to have the draw order you have (maybe there's alpha blending going on) the other lever you can pull is to find ways to make disparate OpenGL calls use more similar drawing state. (This is what texture atlasing does.)  A few tricks:

  • Use a very small solid white texture for non-textured geometry - you can now use your texturing shader at all times.
  • You don't need to get rid of color application in a shader - simply set the color to white opaque.
  • If you use pre-multiplied alpha, you can draw both additive and non-additive alpha from the same state by varying how you prepare your art assets. Opaque assets can be run with the blender on.
In most of these cases, performance is potentially being lost, so you need to be sure that the cost of the small batching and specific draw order needs outweighs the cost of not doing the most efficient thing.  The small white texture should be pretty cheap; GPUs usually have very good texture memory caches.  Blending tricks can be very expensive on mobile GPUs, and old mobile GPUs are very sensitive to the length of the pixel shader, so you only want to leave color on if it's in the vertex shader.

The point of the above paragraph is: measure carefully first, then merge state second; merging state can be a win or a loss, and it's very dependent on the particular model you're drawing.

* Most drivers defer the work of changing the GPU's mode of drawing until you actually say draw. This way it can synchronize the net result of all changing, instead of making a single change each time you call an API command.  Since the gl calls you make don't fit the hardware very well, waiting until the driver can see all changes is a big win.

Saturday, March 14, 2015

glNext is Neither OpenGL nor Next, Discuss

The title is a stretch at best, but as I have said before, good punditry has to take precedence over correctness. Khronos posted a second version of the glNextVulkan + SPIR-V talk with good audio and slices. I'll see you in an hour and a half.

That answered all of our questions, right!  Ha ha, I kid. Seriously though, at least a little bit is now known:
  • Vulkan is not an incremental extension to OpenGL - there's no API compatibility. This is a replacement.
  • Vulkan sits a lot lower in the graphics stack than OpenGL did; this is an explicit low level API that exposes a lot of the hard things drivers did that you didn't know existed.
The driver guys in the talk seem pretty upbeat, and they should: they get to do less work in the driver than they used to! And this is a good thing; the surface area of the OpenGL API (particularly when you combine ARB_compatibility with all of the latest extensions) is kafkaesque. If someone showed you the full API and said "go code that" you'd surely offer to cut off a finger as a less painful alternative.

My biases as a developer are in favor of not throwing out things that work, not assuming that things need a from scratch rewrite just because they annoy you, and not getting excited just because it's shiny and new.  So I am surprised with myself that at this point, I'd much rather have a Vulkan-like API than all of the latest OpenGL extensions, even though it's more work for me. (Remember that work the driver guys aren't going to do?  It goes into the engine layer.)

What's Good/Why Do We Need This?

While there's a lot of good things for game engines in Vulkan, there are a few that got my attention because they are not possible with further extension/upgrade to OpenGL:

Threading: A new API is needed because OpenGL is thread-unfriendly, and it's unfriendly at the core of how the API is written; you can't fix this by adding more stuff. Some things OpenGL does:
  • OpenGL sets up a 1:1 correspondence between queues, command buffers, and threads.  If you want something else, you're screwed, because you get one thing ("the context") and it has damned strict threading rules.
  • OpenGL does the thread synchronization for you, even if you don't want that.  There are locks inside the driver, and you can't get rid of them.*
With Vulkan, command buffers and queues are separate, resource management is explicit, and no synchronization is done on your behalf.

This is definitely a win for game engines. For example, with X-Plane we will load a scenery tile "in the background". We know during loading that every object involved in the scenery tile is "thread local" to us, because they have not been shared. There is no common data between the rendering thread and the loader.

Therefore both can run completely lock free.  There is a one-time synchronization when the fully finished tile is inserted into the active world; this insert happens only after the load is complete (via message Q) and is done between frames by the rendering thread.  Again, no locks.  This code can run lock free at pretty much all points.

There's no way for the GL driver to know that. Every time I go to shovel data into a VBO in OpenGL, the driver has to go "I wonder if anyone is using this?  Is this call going to blow up the world?"  Under Vulkan, the answer is "I'm the app, trust me."  That's going to go a lot faster.  We're getting rid of "safety checks" in the driver that are not needed.

Explicit Performance: one of the hardest things about developing realtime graphics with OpenGL is knowing where the "fast path" is.  The GL API lets you do about a gajillion different things, and only a few call paths are going to go fast.  Sometimes I see threads like this on OpenGL mailing lists:
Newb: hey AMD, when I set the refrigerator state to GL_FROZEN_CUSTARD and then issue a glDrawGizmo(GL_ICECREAM, 10); I see a massive performance slow-down. Your driver sucks!
I'm sitting in front of my computer going "Oh noooooes!!!  You can't use ice cream with frozen custard - that's a crazy thing to do."  Maybe I even write a blog post about it.

But how the hell does anyone ever know?  OpenGL becomes a game of "write-once performance tune everywhere" (or "write once, harass the driver guys to run your app through vtune and tell you you're an idiot everywhere") - sometimes it's not possible to tell why something is slow (NVidia, I'm looking at you and your stripped driver :-) and sometimes you just don't have time to look at every driver case (cough cough, Intel, cough).

OpenGL doesn't just have a huge API, it has a combinatorially huge API - you can combine just about anything with anything else; documenting the fast path (even if all driver providers could agree) is mathematically impossible.

Vulkan fixes this by making performance explicit.  These functions are fast, these are slow.  Don't call the slow calls when you want to go fast.  It gives app developers a huge insight into what is expensive for the driver/hardware and what is not.

Shim It: I may do a 180 on this when I have to code it, but I think it may be easier to move legacy OpenGL apps to Vulkan specifically because it is not the OpenGL API.

When we had to port X-Plane 9 for iPhone from GLES 1.1 to GLES 2.0, I wrote a shim that emulated the stuff we needed in GLES 1.1  Some of this is now core to our engine (e.g. the transform stack) and some still exists because it is only used in crufty non-critical path code and it's not worth it to rip it out (e.g. glBegin).

The shimming exercise was not that hard, but it was made more complicated by the fact that half of the GL API is actually implemented in both versions of the spec.  I ended up doing some evil macro trickery: glDrawElements gets #defined over to our internal call, which updates the lazily changed transform stack and then calls the real glDrawElements.  Doing this level of shim with the full desktop GL API would have been quite scary I think.

Because Vulkan isn't gl at all, one option is to simply implement OpenGL using...Vulkan. I will be curious if a portable open source  gl layer emerges; if it does, it would be a useful way for very large legacy code bases to move to Vulkan.  There'd be two wins:

  1. Reliability.  That's a lot less code that comes from the driver; whether your gl layer works right or is buggy as a bed in New York, it's going to be the same bugs everywhere - if you've ever tried to ship a complicated cross-platform OpenGL app, having the same bugs everywhere is like Christmas (or so I'm told).
  2. Incremental move to Vulkan.  Once you are running on a GL shim, poke a hole through it when you need to get to the metal for only performance-critical stuff.  (This is what we did with GLES 1.1/2.0: the entire UI ran in GLES 1.1 emulation and the rendering engine went in and bound its own custom shaders.)

Vulkan is Not For Everyone

When "OpenGL is Borked" went around the blogs last year one thing that struck me was how many different constituencies were grumpy about OpenGL, often wanting fixes that could not co-exist. Vulkan resolves this tension: it's low level, it's explicit, it's not backward compatible, and therefore it's only for developers who want to do more work to get more perf and don't need to run on everything or can shim their old code.

I think this is a good thing: at least Vulkan can do the limited task it tries to do well. But it's clearly not for beginners, not for teaching an introduction to 3-d graphics, and if you were grumpy about how much work it was to use GLES 2.0 for your mobile game, Vulkan's not going to make you very happy.  And if you're sitting on 100,000,000 lines of CAD code that's all written in OpenGL, Vulkan doesn't do you as much good as that one extension you really really really need.

For developers like me (full time, professional, small company, proprietary engine) there's definitely going to be a cost in moving to Vulkan in development time. Whenever the driver guys talk about resource management they often say something like:
The app has to do explicit resource management, which it's probably already doing on console.
for the big game engines this is totally true, so being able to re-use their resource management code is a win. For smaller games, OpenGL is their resouce management code.  It's not necessarily very good resource management (in that the GL driver is basically guessing about you want, and sometimes guessing wrong) but if you have a three-person development team, having Graham Sellers write your resource management code for you for free is sort of an epic win.

Resource management is the one area where what we know now is way too fuzzy. You can look at Apple's Metal API (fully public, shipping, code samples) and see what a world with non-mutable objects, command queues and command buffers looks like. But resource management in Metal is super simple because it only runs on a shared memory device: a buffer object is a ptr to memory, full stop.  (Would if it were that easy on all GPUs.)

It's too soon to tell what the "boiler plate" will look like for handling resource management in Vulkan.  There's a huge difference in the quality of resource management between different driver stacks; writing a resource manager that does as well as AMD or NVidia's is going to be a real challenge for small development teams.

* My understanding is that if you create only one GL context (and thus you are not using threads) the driver will actually run in a lock-free mode to avoid over-head.  The fact that the driver bothers to detect that and special case it gives you some idea how crazy a GL driver is.  If that doesn't, read this.