Saturday, November 21, 2015

Blender Notepad - Eulers

When Blender describes a rotation as an 'XYZ' Euler (with 3 angles), this is what they mean:
  • The Z axis is "up" (the Y axis is away from us to be right-handed).
  • Each rotation is around the named axis.  So X is a rotation around the X axis (a "pitching up" rotation for pilots).
  • The rotations are done in the order listed, extrinsically. In other words, we rotate around each of these global axes.
The net result of this is that the X rotation is affected by the Y and Z (because they happen later).  If we were rotating around the rotated Y (Y') or rotated Z (Z'') axis, then the X axis would be unaffected.

The net result is that (from an aviation-angles perspective) we do yaw first (global Z is unaffected), then roll (transformed Y), then pitch (transformed X).  (It should be noted that with pitch last, this does not even remotely correspond to how pilots think about these angles.)

To match X-Plane's transform, instead of XYZ, we need (in Blender) YXZ, which puts Y (roll) at lowest priority.

How the 2.49 Exporter Goes to Crazy Town

Blender 2.75 lets you select orientations; 2.49 is always in XYZ mode.  Since these are global axes, the correct order to apply them in an OBJ is:
ANIM_rotate 0 0 1
ANIM_rotate 0 1 0
ANIM_rotate 1 0 0
That is, apply Z first, since X-Plane only has local transforms.  (That is, in X-Plane, the last animation is affected by the prior two.)

When the Blender 2.49 exporter decomposes rotations into Eulers, it goes in this order, but it does so in X-Plane coordinates.  Thus while "yaw" is unchanged in XYZ animation in Blender, "roll" is unchanged in the export.

Friday, November 20, 2015

SASL Crash on El Capitan - the Gory Details

I'm trying to not clog up the X-Plane developer blog with tons of technical C++ details. There are a small number of developers who actually want to know those details, so I'm going to post them here. This post explains why SASL was crashing on plugin-unload on El Capitan (but not older operating systems).

Both SASL and Apple's OpenAL implementation are open source, so despite this being a bug that was totally not in the X-Plane code base, I was able to look at everyone involved and debug it myself. I am not particularly happy about having to do that, but the symptoms of the bug were:
  • Upgrade to El Capitan for free - why not, new things are shiny.
  • Run X-Plane - seems okay!
  • Run SASL plane - seems okay!
  • Switch from SASL plane to plane that ships with X-Plane. Oh noes - my sim crashed! Report a bug to Laminar Research.
The back-trace from the Apple crash reports were all very clear: X-Plane was unloading SASL, SASL was asking OpenAL to tear down its audio context, and OpenAL was throwing an uncaught exception.

So I got involved because users thought this was our bug, even though it wasn't.

Hrm - new crash in Apple's framework in a new OS. Blame Apple! Except, no other OpenAL code is crashing.

Apple's Bug

It turns out there is a bug in Apple's OpenAL. It's one that has been in there for a long time, but only shows up in El Capitan, and frankly doesn't matter in any real way. On OS X, if you call alcDestroyContext on a context that has (1) playing sounds and (2) is the only context for its device and (2) isn't using effects on those sounds then you get an uncaught exception on El Capitan.

The actual bug is subtle - the tear-down order of the underlying audio units that power Apple's OpenAL implementation isn't quite right in this case, resulting in AudioUnits returning an error code in a destructor.  The code throws this and catches it in the underlying alcDestroyContext call.

From what I can tell, there was a tool chain change in El Capitan that causes this to terminate an app. I am not an expert, but I think that throwing an exception out of a destructor is undefined behavior, and now Clang is putting its foot down. When I compiled OpenAL from source, my built version simply caught the exception and returned it from alcDestroyContext.

For what it's worth, I don't consider this a severe bug or engineering failure by Apple. The OpenAL specification is a total disaster, and I don't blame anyone who misses a corner case (assuming deleting a playing context even is legal - with a spec like that, who knows). And no app in its right mind would just go kill the context without stopping audio first. Which brings us to SASL's bug.

SASL's Bug

SASL had a bug too. SASL uses a stack based C++ class to change the OpenAL audio context from X-Plane's context to its own to do audio work and then turn it back when done. This is a classic RAII way to manage state.
ContextChanger changer(sound->context);
Except the clean-up code in SASL had this:
That is, of course, totally legal C++, and totally not useful. I look forward to the day when creating a temporary object in its own expression with a non-trivial destructor is a warning, because I've done this in my own code too.

Without a working context changer, SASL's cleanup code would attempt to clean up all of X-Plane's audio objects (not cool man, not cool!) and then kill its own context. Of course, its own context was still playing since no cleanup had happened.

To put it bluntly, this bug makes me pretty mad, and here's why:
  • This code has literally never worked right. Not once, not since day one.
  • The fact that this code was not working right was easily detectable just by checking the OpenAL error code. When SASL goes to delete sources in the wrong context, in most cases the source names are wrong and OpenAL returns an error code. During development and in debug mode, SASL should be checking the OpenAL error code, at least when it finishes its own work before returning control to X-Plane.
Unfortunately, before this bug was fixed, SASL contained only one bit of "error checking code":
ContextChanger(ALCcontext *context) {oldContext = alcGetCurrentContext();alcMakeContextCurrent(context);alGetError();};
If you don't speak OpenAL, basically that's SASL clearing the error code before beginning audio work, with no check of what is in there. This is not how to do error checking.

The Fix Is In

The good news is that the newest version of SASL (2.4 as of this writing) fixes the context changer bug, and also in some cases checks the OpenAL error code after issuing OpenAL commands. The error checking is not as complete as I'd like to see, and still will silence the error sometimes, but it's a step in the right direction.

Are there any teachable moments here? I think there are a few:
  • If an API provides return codes* for the purpose of determining program correctness (E.g. OpenAL returning "invalid source") it is absolutely to leverage those return codes to do debug assertion checking.
  • It is not good enough to run the code and observe expected behavior at the user level - you need to verify that the code is actually doing what you expect, or you don't know. (A very wise senior engineer once told that to me 21 (!) years ago when I was just an intern at Avid's taken me about that long to deeply understand this in my gut.)
  • Any time the behavior of code isn't going to be directly user observable (which includes pretty much all resource cleanup code), you need to design the system for debug-ability, e.g. create test cases, attach the debugger, put logging in place, put assertions in place. Proving a program is correct and debugging it is a design requirement just like functionality.

* I don't want to use the term error codes for these returns because I think it is important to distinguish between mistakes in program correctness (you, the programmer, screwed up) and expected failures of hardware (e.g. a disk read error). Having a return enumeration from a function is a coding idiom that can be used for either of these cases. In the case of OpenAL and OpenGL, the returned code detects both programmer mistakes and underlying "errors", e.g. exhaustion of memory.

Thursday, June 18, 2015

glMapBuffer No Longer Cool

TL;DR: when streaming uniforms, glMapBuffer is not a great idea; glBufferSubData may actually work well in some cases.

I just fixed a nasty performance bug in X-Plane, and what I found goes directly against stuff I posted here, so I figured a new post might be in order.

A long time ago I more or less wrote this:

  • When you want to stream new data into a VBO, you need to either orphan it (e.g. get a new buffer) or use the new (at the time) unsynchronized mapping primitives and manage ranges of the buffer yourself.
  • If you don't do one of these two things, you'll block your thread waiting for the GPU to be done with the data that was being used before.
  • glBufferSubData can't do any better, and is probably going to do worse.
Five years is a long time in GPU history, and those rules don't quite apply.

Everything about not blocking on the GPU with map buffer is still true - if you do a synchronized map buffer, you're going to block hard.  Never do that!

But...these days on Windows, the OpenGL driver is running in a separate thread from your app. When you issue commands, it just marshals them into a FIFO as fast as it can and returns. The idea is to keep the app rendering time and driver command buffer assembly from being sequential.

The first problem is: glMapBuffer has to return an actual buffer pointer to you! Since your thread isn't actually doing real work, this means one of two things:

  1. Blocking the app thread until the driver actually services the requests, then returning the result. This is bad. I saw some slides a while back where NVidia said that this is what happens in real life.
  2. In theory under just the right magic conditions glMapBuffer could return scratch memory for use later. It's possible under the API if a bunch of stuff goes well, but I wouldn't count on it. For streaming to AGP memory, where the whole point was to get the real VBO, this would be fail.
It should also be noted at this point that, at high frequency, glMapBuffer isn't that fast. We still push some data into the driver via client arrays (I know, right?) because when measuring unsynchronized glMapBufferRange vs just using client arrays and letting the driver memcpy, the later was never slower and in some cases much faster.*

Can glBufferSubData Do Better?

Here's what surprised me: in at least one case, glBufferSubData is actually pretty fast. How is this possible?

A naive implementation of glBufferSubData might look like this:
void glBufferSubData(GLenum target, GLintptr offset, GLsizeiptr size, const GLvoid * data)
GLvoid * ptr = glMapBuffer(target,GL_WRITE_ONLY);
memcpy(ptr, data, size);
The synchronized map buffer up top is what gets you a stall on the GPU, the thing I was suggesting is "really really bad" five years ago.

But what if we want to be a little bit more aggressive?
void glBufferSubData(GLenum target, GLintptr offset, GLsizeiptr size, const GLvoid * data)
if(offset == 0 && size == size_of_currently_bound_vbo)
GLvoid * ptr = glMapBuffer(target,GL_WRITE_ONLY);
memcpy(ptr, data, size);
In this case, we have, in the special case of completely replacing the VBO, removed the block on the GPU. We know it's safe to simply orphan and splat.

What's interesting about this code is that the API to glBufferSubData is one-way - nothing is returned, so the code above can run in the driver thread, and the inputs to glBufferSubData can easily be marshaled for later use.  By keeping the results of glMapBuffer private, we can avoid a stall.

(We have eaten a second memcpy - one to marshall and one to actually blit into the real buffer. So this isn't great for huge amounts of data.)

Anyway, from what I can tell, the latest shipping drivers from NVidia, AMD and Intel all do this - there is no penalty for doing a full glBufferSubData, and in the case of NVidia, it goes significantly faster than orphan+map.

A glBufferSubData update like this is sometimes referred to as "in-band" - it can happen either by the driver queuing a DMA to get the data into place just in time (in-band in the commands stream) or by simply renaming the resource (that is, using separate memory for each version of it).

Using glBufferSubData on Uniforms

The test case I was looking at was with uniform buffer objects.  Streaming uniforms are a brutal case:

  • A very small amount of data is going to get updated nearly every draw call - the speed at which we update our uniforms basically determines our draw call rate, once we avoid knuckle-headed stuff like changing shaders a lot.
  • Loose uniforms perform quite well on Windows - but it's still a lot of API traffic to update uniforms a few bytes at a time.
  • glMapBuffer is almost certainly too expensive for this case.
We have a few options to try to get faster uniform updates:

  1. glBufferSubData does appear to be viable. In very very limited test cases it looks the same or slightly faster than loose uniforms for small numbers of uniforms. I don't have a really industrial test case yet. (This is streaming - we'd expect a real win when we can identify static uniforms and not stream them at all.)
  2. If we can afford to pre-build our UBO to cover multiple draw calls, this is potentially a big win, because we don't have to worry about small-batch updates. But this also implies a second pass in app-land or queuing OpenGL work.**
  3. Another option is to stash the data in attributes instead of uniforms. Is this any better than loose uniforms? It depends on the driver.  On OS X attributes beat loose uniforms by about 2x.
Toward this last point, my understanding is that some drivers need to allocate registers in your shaders for all attributes, so moving high-frequency uniforms to attributes increases register pressure. This makes it a poor fit for low-frequency uniforms. We use attributes-as-uniforms in X-Plane for a very small number of parameters where it's useful to be able to change them at a frequency close to the draw call count.

I'm working on a comprehensive test engine now to assess performance on every driver stack I have access to. When I have complete data, I'll write up a post.

* The one case that is pathological is the AMD Catalyst 13-9 drivers - the last ones that support pre-DX11 cards. In those cards, there is no caching of buffer mappings, so using map buffer at high frequency is unshipable.  The current AMD glMapBuffer implementation for DX11 cards appears to have similar overhead to NVidia's.

* This is a case we can avoid in the next-gen APIs; since command buffers are explicitly enqueued, we can leave our UBO open and stream data into it as we write the command buffer, and know that we won't get flushed early.  OpenGL's implicit flush makes this impossible.

Wednesday, June 10, 2015

OS X Metal - Raw Notes

Just a few raw notes about Metal as I go through the 2015 WWDC talks. This mostly discusses what has been added to Metal to make it desktop-ready.

Metal for desktop has instancing, sane constant buffers, texture barrier, occlusion query, and draw-indirect. It looks like it does not have transform feedback, geometry shaders or tessellation. (The docs do mention outputting vertices to a buffer with a nil fragment function, but I don't see a way to specify the output buffer for vertex transform. I also don't see any function attach points for geometry shaders or tessellation shaders.)

The memory model is "common use cases" - that is, they give you a few choices:
  • Shared - one copy of the data in AGP memory (streaming/dynamic in OpenGL). Coherency is apparently at the command buffer, not continuous, so this apparently does not rely on map-coherent (and I would assume has lower overhead).
  • Managed - what we have now for static geometry: a CPU side and GPU side copy that are synced. Sync is explicit by flushing CPU side changes (via didModifyRange).  Reading back changes from the GPU is explicit -and- queued via a synchronizeResource call on the blit command encoder. For shared memory devices (e.g. Intel?) there's only one copy (e.g. this backs off to "shared").
  • Private - the memory is entirely on the GPU side (possibly in VRAM) - the win here is that the format can be tiled/swizzled/whatever is fastest for the GPU. Therefore this is the right storage option for framebuffers.  Access to/from the data is only from blit command encoder operations.
  • Auto - a meta-format for textures - turns into shared on IOS (which has no managed) and managed on desktop - so that cross-platform code can do one thing everywhere. (This seems odd to me, because desktop apps will want to have some meshes be managed too.)
The caching model (e.g. write-combined) is a separate flag on buffer objects. Mapping is always available and persistent.

There appears to be no access to the parallel command queues that modern GCN2 devices have. Modern GPUs can run blit/DMA operations in parallel with rendering operations; this uses the hardware more efficiently but also introduces an astonishing amount of complexity into APIs like Mantle that tell developers "here's two async queues - good luck staying coherent."

Perhaps Metal internally offloads blit command buffers to the DMA queue and inserts a wait in the rendering encoder for the bad-luck case where the blit doesn't finish enough ahead of time.

Unlike Mantle, there is no requirement to manually manage the reference pool for the resources that a command queue has access to - this also simplifies things.

Finally, Mantle has a more complex memory model; in Mantle, you get big pools of memory from the driver and then jam resources into them yourself, letting games create pool allocators as desired. In Metal, everything's just a resource; you don't really know how much VRAM you have access to or how stuffed it is, and paging the managed pool is entirely within the driver.  (One exception: you can create a buffer directly off of a VM page with no copy, but this still isn't the same as what Mantle gives you.)

As the guy who would have to code to these APIs, I definitely like Apple's simpler, more automatic model a lot more than the state change rules for Mantle; naively (having not coded it) it seems like the app logic to handle state change would be either very complex and finicky or non-optimal. But I'd have to know what the cost in performance is of letting the driver keep these tasks. Apple's showing huge performance wins over OpenGL, but that doesn't validate the idea of driver-managed resource coherency; you'd have to compare to Mantle to see who should own the task.

I can imagine AAA game developers being annoyed that there isn't a pooling abstraction like in Mantle, since the "pool of memory you subdivide" model works well with what console games do on their own to keep memory use under control.

Overall, based on what I've read of the API, Metal looks a solid, well-thought-out next generation API; similar to Mantle in how it reorganizes work-flow, but less complex. It's still missing some modern desktop GPU functionality, but moving Metal to discrete hardware with discrete memory hasn't turned the API into a swamp.

Finally, from an adoption stand-point: it looks to me that Metal on OS X gives Apple a way to try to leverage its strong position in mobile gaming to move titles to the desktop, which is a more viable sell than trying to use a more modern OpenGL to move titles from PC. (Only having a DirectX clone would help with that.)

Friday, May 22, 2015

Underestanding PowerVR GPUs via Metal

In my previous post I suggested that OpenGL and OpenGL ES, as APIs, don't always fit the underlying hardware. One way to understand this is to read GPU hardware documentation - AMD is pretty good about posting hardware specs, e.g. ISAs, register listings, etc. You can also read the extensions and see the IHV trying to bend the API to be closer to the hardware (see NVidia's big pile of bindless this and bindless that). But both these ways of "studying" the hardware are time consuming and not practical if you don't do 3-d graphics full time.

Recently there has been a flood of new low-level, close-to-the-hardware APIs: metal (Apple, PowerVR), Mantle (AMD, GCN), Vulkan (Khronos, everything), DirectX 12 (Microsoft, desktop GPUs). This provides us another way of understanding the hardware: we can look at what the graphics API would look like if it were rewritten to match today's hardware.

Let's take a look at some Metal APIs and see what they tell us about the PowerVR hardware.

Mutability Is Expensive

A texture in Metal is referenced via an MTLTexture object.* Note that while it has properties to get its dimensions, there is no API to change its size! Instead you have to fill in a new MTLTextureDescriptor and use that to make a brand new MTLTexture object.

In graphics terms, the texture is immutable. You can change the contents of its image, but you can't change the object itself in such a way that the underlying hardware resources and shader instructions associated with the texture have to be altered.

This is a win for the driver: when you go to use an MTLTexture, whatever was true about the texture last time you used it is still true now, always.

Compare this to OpenGL. With OpenGL, you can bind the texture id to a new texture - not only with different dimensions, but maybe of a totally different type. Surprise, OpenGL - that 2-d texture I used is now a cube map! Because anything can change at any time, OpenGL has to track mutations and re-check the validity of bound state when you draw.

Commands Are Assembled in Command Buffers and Then Queued for the GPU

How do your OpenGL commands actually get to the GPU? The OpenGL way involves a fair amount of witchcraft:
  1. You make an OpenGL context current to a thread.
  2. You issue function calls into the OpenGL API.
  3. "Later" stuff happens. If you never call glFlush, glFinish, or some kind of swap command, maybe some of your commands never execute.
That's definitely not how the hardware works. Again, Metal gives us a view of the underlying implementation.

On every modern GPU where I've been able to find out how command processing works, the GPU follows pretty much the same design:
  1. The driver fills in a command buffer - that is, a block of memory with GPU commands (typically a few bytes each) that tell the GPU what to do. The GPU commands don't match the source API - there will typically be commands for draw calls, setting up registers on the GPU, and that might be it.
  2. The driver queues completed command buffers for the GPU to run in some kind of order. The GPU might DMA the command buffer into its own space, or it might read it out of system memory.
Metal exposes this directly: MTLCommandBuffer represents a single command buffer, and MTLCommandQueue is where you queue it once you're done encoding it and you want the GPU to operate on it.

It turns out a fair amount of the CPU time the driver spends goes into converting your OpenGL commands into command buffers.  Metal exposes this too via specific MTLCommandEncoder subclasses. We can now see this work directly.

When you issue OpenGL commands, the encoder is built into the context, is "discovered" via your current thread, and commands are sent to a command buffer that is allocated on the fly. (If you really push the API hard, some OpenGL implementations can block in random locations because the context's encoder can't get a command buffer.)

The OpenGL context also has access to a queue internally, and will queue your buffer when (1) it fills up or (2) you call one of glFlush/glFinish/swap. This is why your commands might not start executing until you call flush - if the buffer isn't full, OpenGL will leave it around, waiting for more commands.

One last note: the race condition between the CPU writing commands and the GPU reading them is handled by a buffer being in only one place at a time, whether it's the CPU (encoding commands) or GPU (executing them) - this is true for both Metal and GLES. So while you are encoding commands, the GPU has not started on them yet.

Normally this is not a problem - you queue up a ton of work and the GPU always has a long todo list. But in the non-ideal case where GPU latency matters (e.g. you want the answer as fast as possible), in OpenGL ES you might have to issue a flush so the GPU can start - OpenGL will then get you a new command buffer. (This is why the GL spec has all of that language about glFlush ensuring that commands will complete in finite time - until you flush, the command buffer is just sitting there waiting for the driver to add more to it.)

The GPU Does Work When You Start and End Rendering to a Surface

As I am sure you have read 1000 times, the PowerVR GPUs are tiled deferred renderers. What this means is that rasterizing and fragment shading are done on tiny 32x32 pixel tiles of the screen, one at at time. (The tile size might be different - I haven't found a good reference.) For each rendering pass, the GPU iterates on each tile of the surface and renders everything in the rendering pass that intersects that tile.

The PowerVR GPUs are designed this way so that they can function without high-speed VRAM tied to a high-bandwidth memory bus. Normal desktop GPUs use a ton of memory bandwidth, and that's a source of power consumption.  The PowerVR GPUs have a tiny amount of on-chip video memory; for each tile the surface is loaded into this cache, fully shaded (with multiple primitives) and then saved back out to shared memory (e.g. the surface itself).**

This means the driver has to understand the bounded set of drawing operations that occur for a single surface, book-ended by a start and end. The driver also has to understand the life-cycle of this rendering pass: do we need to load the surface from memory to modify it, or can we just clear it and draw? What results actually need to be saved?  (You probably need your color buffer when you're done drawing, but maybe not the depth buffer. If depth was just used for hidden surface removal, you can skip saving it to memory.) Optimizing the start and end of a surface rendering pass saves a ton of bandwidth.

Metal lets you specify how a rendering pass will work explicitly: an MTLRenderPassDescriptor describes the surfaces you will render to and exactly how you want them to be loaded and stored. You can explicitly specify that the surface be loaded from memory, cleared, or whatever is fastest; you can also explicitly store the surface, use it for an FSAA resolve, or discard it.

To get a command encoder to render (a MTLRenderCommandEncoder), you have to pass a MTLRenderPassDescriptor describing how a pass is book-ended and what surfaces are involved. You can't not answer the question.

Compare this to OpenGL ES; when you bind a new surface for drawing, the driver must note that it doesn't know how you want the pass started. It then has to track any drawing operation (which will implicitly load the surface from memory) as well as a clear operation (which will start by clearing). Lots of book-keeping.

The Entire Pipeline Is Grafted Onto Your Shader

OpenGL encourages us to think of the format of our vertex data as being part of the vertex data, because we use glVertexAttribPointer to tell OpenGL how our vertices are read from a VBO.

This view of vertex fetching is misleading; glVertexAttribPointer really wraps up two very different bits of information:

  • Where to get the raw vertex data (we need to know the VBO binding and base pointer) and
  • How to fetch and interpret that data (for which we need to know the data type, stride, and whether normalization is desired).
The trend in recent years is for GPUs to do vertex fetch "in software" as part of the vertex shader, rather than have fixed function hardware or registers that do the fetch. Moving vertex fetch to software is a win because the hardware already has to support fast streamed cached reads for compute applications, so some fixed function transistors can be thrown overboard to make room for more shader cores.

On the desktop, blending is still fixed function, but on the Power VR, blending and write-out to the framebuffer is done in the shader as well.  (For a really good explanation of why blending hasn't gone programmable on the desktop, read this.  Since the currently rendered tile is cached on chip on the PowerVR, you can see why the arguments about latency and bandwidth from desktop don't apply here, making blend-in-shader a reasonable idea.)

The sum of these two facts is: your shader actually contains a bunch of extra code, generated by the driver, on both the front and back.

Metal exposes this directly with a single object: MTLRenderPipelineState. This object wraps up the actual complete GPU pipeline with all of the "extra" stuff included that you wouldn't know about in OpenGL. Like most GPU objects, the pipeline state is immutable and is created with a separate MTLRenderPipelineDescriptor object. We can see from the descriptor that the pipeline locks down not only the vertex and fragment functions, but also the vertex format and anti-aliasing properties for rasterization. Color mask and blending is in the color attachment descriptor, so that's part of the pipeline too.

Every time you change the vertex format (or even pretend to by changing the vertex base pointer with glVertexAttribPointer), every time you change the color write mask, or change blending, you're requiring a new underlying pipeline to be built for your GLSL shader. Metal exposes the actual pipeline, allowing for greater efficiency. (In X-Plane, for example, we always tie blending state to the shader, so a pipeline is a pretty good fit.)

If there's a summary here, it's that GLES doesn't quite match the PowerVR chip, and we can see the mismatch by looking at Metal. In almost all cases, the driver has to do more work to make GLES fit the hardware, inferring and guessing the semantics of our application.

I'll do one more post in this series, looking at Mantle, and some of the terrifying things we've never had to worry about when running OpenGL on AMD's GCN architecture.

* Technically the real API objects are all ObjC protocols, while the lighter-weight struct-like entities are objects. I'll call them all objects here - to client code they might as well be. The fact that API-created objects are protocols stops you from trying to alloc/init them.

** Besides saving bus bandwidth, this technique also saves shading ops. Because the renderer ha access to the entire rendering pass before it fills in a tile, it can re-order opaque triangles for perfect front-to-back rendering, leveraging early Z rejection.

Tuesday, April 14, 2015

The OpenGL Impedance Mismatch

As graphics hardware has changed from a fixed function graphics pipeline to a general purpose parallel computing architecture, mid-level graphic APIs like OpenGL don't fit the execution model of the actual hardware as well as they used to.

In my previous post, I said that the execution of GL state change is deferred so that the driver can figure out what you'r really trying to do and efficiently change all state at once.

This has been true for a while. For example, older fixed function and partly programmable GPUs might have one set of register state to control the entire fixed-function raster operations.  Here's the R300 (e.g. the Radeon 9700).

  • The blend function and sources share a single register, but
  • The alpha and RGB blend function/sources are in different registers (meaning a single glBlendFuncSeparate partly updates both).
  • Alpha-blend enable shares a register with the flag to separate the blender functions. (Why the hardware doesn't just always run separate and let the driver update both sides of the blender is a mystery to me.)
  • Some GL state actually matches the register (e.g. the clear color is its own register).
So the match-up between imaginary ideal GL pipeline and the hardware isn't perfect. But in the end, the fit is actually pretty good:

  • Fixed function tricks like blending and stenciling are enabled by setting registers on the GPU.
  • Uniforms for a given shader live on the chip while the shader is executing.
  • The vertex fetcher is fixed functionality that is set up by register.
There's a lot written about AMD's Graphics Core Next (GCN) architecture, the GPU inside the Radeon 7900 and friends.  Since GCN GPUs are in both the X-Box One and Playstation 4 and AMD is reasonably loose with chip documentation and disassembling compilers, we know a lot about how the hardware really works.  And the not so snug.

  • Shader constants come from memory (this has been true for a while now) - this is a good fit for a UBOs but a bad fit for "loose uniforms" that are tied to the shader object.  On the GPU, the shader object and uniforms are fully separable.
  • Vertex fetch is entirely in the shader - the driver writes a pre-amble for you.  Thus changing the vertex alignment format (but not the base address) is a shader edit!  Ouch.
  • For shaders that write to multiple render targets, OpenGL lets us remap them via glDrawBuffers, but this export mapping is part of the fragment shader, so that's a shader edit too.
Those shader edits are particularly scary - this is a case where we (the app) think we're doing something orthogonal to the shading pipeline (e.g. just setting up a new VBO) but in practice, we're getting a full shader change.

In fact, the impedance match makes this even worse: if we're going to have any hope of changing state quickly, the driver has to track past combinations of vertex layout, MRT indirection, and the actual GLSL linked program, and cache the "real" shader that backs this combined state.  Each time we change the front-end vertex fetch format or back-end MRT layout, the driver has to go see if that combination exists in cache.

The back-end MRT layout isn't the worst problem because we are hopefully not going to change rendering targets that frequently.  But the vertex format is a real mess; every call to glVertexAttribPointer potentially invalidates the vertex layout; the driver can either try to heavily check state change, or regenerate the shader front-end; both options stink.

You can see OpenGL trying to track the moving target of the hardware in the extensions: GL_ARB_vertex_array_object was made part of core OpenGL 3.0 and ties up the entire vertex fetch plus base pointer in a single "object" for quick recall.  But we can see that this is now a pretty poor fit; half of the state that the VAO covers (the layout) is really part of the shader, while the other half (the actual address of the VBO plus offset) is separate.*

A newer extension, GL_ARB_vertex_attrib_binding, separates the vertex format (which is part of the shader in hardware) from the actual data location; it was made part of OpenGL 4.3. I don't know how good of a fit this is; the vertex attribute binding leaves the data stride out of the "expensive" format binding.  (My guess is that the intended implementation is to specify the data stride as a constant in a constant buffer somewhere.) In theory with this extension, only glVertexAttribFormat requires an expensive shader patch, and applications can change VBO sources without calling it.

If there's an executive summary here, it's that OpenGL as an API has never been a perfect representation of what the hardware is doing, but as the hardware moves toward general purpose compute devices that work on buffers of memory, the pipeline-and-state model fits less and less.

In my next posts I'll take a look at Metal and Mantle - these new APIs let us take the red pill and see how deep the rabbit hole goes.

* I am of the opinion that VAOs were a mistake from day one.  VAOs are mutable to allow them to be 'layered' on top of existing code the way VBOs were, and even if they weren't, the data location of the VBO is mutable at the driver level (because the VBO may at the time of draw be in VRAM or system memory, and may require a change to the memory map of the CPU that the GPU holds to draw, or it may require a DMA copy to move it to RAM).  The result is that binding a VAO doesn't let you skip the tons of validation and synchronization needed to actually start drawing once the base pointers have been moved.

OpenGL State Change Is Deferred

This is totally obvious to developers who have been coding high performance OpenGL for years, but it might not be obvious to newer developers starting with OpenGL or OpenGL ES, so...

In pretty much any production OpenGL driver, the real 'work' of OpenGL state change is deferred - that work is executed on the next draw call (e.g. glDrawElements or glDrawArrays).

This is why, when you profile your code, glBindBuffer and glVertexPointer appear to be "really fast" and yet "glDrawArrays" is using a ton of CPU Time.

The work of setting up the hardware for GL state is deferred because often the state cannot be set up until multiple calls come in.

Let's take as an example, vertex format.  You do this:
glVertexAttribPointer(0, 3, GL_FLOAT, GL_FALSE, 32, (char *) 0);
glVertexAttribPointer(1, 3, GL_FLOAT, GL_FALSE, 32, (char *) 12);
glVertexAttribPointer(2, 2, GL_FLOAT, GL_FALSE, 32, (char *) 24);
The way this is implemented on modern GPUs is to generate a subroutine or pre-amble code for your vertex shader that "executes" the vertex fetch based on these stride rules.

There's no point in generating the shader until all of the vertex format is known; if the driver went and patched your shader after the first call (setting up attribute 0) using the old state of attributes 1 and 2, all of that work is wasted and would be redone when the next two glVertexAttribPointer calls come in.

Furthermore, the driver doesn't know when you're done.  There is no glDoneScrewingAroundWithVertexAttribPointer call.

So the driver does the next best thing - it waits for a draw call.  At that point it goes "hey, I know you are done changing state because this draw call uses what you have set now."  At that point it goes and makes any state change that is needed since the last draw call.

What this means is that you can't tell how "expensive" your state change is by profiling the code doing the state change.  The cost of state change when you call it is the cost of recording for later what needs to be done, e.g.
void glBlendFunc(GLenum sfactor, GLenum dfactor)
   context * c = internal_get_thread_gl_context();
   c->blend.sfactor = sfactor;
   c->blend.dfactor = dfactor;
   c->dirty_bits |= bit_blend_mode;
In other words, the driver is just going to record what you said to the current context and make a note that we're "out of sync" state-wise.  The draw call does the heavy lifting:
void glDrawArrays(GLenum mode, GLint first, GLsizei count)
   context * c = internal_get_thread_gl_context();
   if(c->dirty_bits & bit_blend_mode)
     /* this is possibly slow */
   /* more check and sync */
   c->dirty_bits = 0;
   /* do actual drawing work - this isn't too slow */
On Apple's OpenGL implementation, the stack is broken into multiple parts in multiple dylibs, which means an Instruments trace often shows you subroutines with semi-readable names; you can see draw calls updating and synchronizing state.  On Windows the GL stack is monolithic, stripped, and often has no back-trace info, which makes it hard to tell where the CPU is spending time.

One final note: the GL driver isn't trying to save you from your own stupidity.  If you do this:
for(int i = 0; i < 1000; ++i)
   glDrawArrays(GL_TRIANGLES, i*12, 12);
Then every call to glEnable is likely to make the blend state 'dirty' and every call to glDrawArrays is going to spend time re-syncing blend state on the hardware.

Avoid calling state changes that aren't needed even if they appear cheap in their individual function call time - they may be "dirtying" your context and driving up the cost of your draw calls.