Saturday, August 31, 2013

Theoretical Engineering: Occlusion Culling for BrickSmith

I have been pushing for a formal level-of-detail scheme for LDraw; the performance problem is that LDraw bricks contain a fairly high vertex count; at low zoom a large model might be entirely on screen, resulting in frame-rate limited by pure vertex count.  (The vertices are on the GPU, the batches are drawn with array-attribute-divisor instancing, and the destination viewport is small with simple shaders, so the limiting factor is trying to cram 125M vertices through the GPU.)

During the discussion, some developers brought up the question of occlusion culling.  In many of these wide views, a fair amount of the geometry is going to have no contribution in the final image due to occlusion by other, closer bricks.

Occlusion culling is tricky stuff. Occlusion queries on the GPU are not available to the CPU until drawing completes; recent GPUs are finally gaining the ability to use the occlusion query result directly, eliminating back-to-the-CPU latency.  CPU-side occlusion queries are usually facilitated by heavy pre-processing of the source data, but that isn't an easy option in BrickSmith because it is an editor, not a viewer (and therefore the source data is continually changing).

What follows is speculative engineering, e.g. how could occlusion culling work on a modern GPU. Looking at the requirements for the design we can then infer as to whether such a scheme is feasible. (Spoiler alert: it is not.)

When looking at the algorithm that follows, please keep one thing in mind: the limiting factor here is vertex count - that's what we are trying to optimize.  This is a very unusual situation in 3-d graphics; most systems are designed to not have this problem. In particular, we are not fill rate bound, so we can expect GPU commands involving small numbers of vertices to complete reasonably quickly, because the GPU isn't drowning in last-frames pixel-fill operations.

Here's the basic idea:
  • Every part that we draw is divided into two sets of geometry: the crude occluders and the rest of the part.  The crude occluders are the non-transparent triangles and quads (no lines) that are big enough to be worth drawing early to start filling space.  The goal is to keep fill rate high while lowering vertex count, so that we can fill a lot of the screen quickly and then see what is left. (For example, for a 2x4 brick, the outer 5 quads of the brick top and sides would be the crude occluders; the entire interior, edging, studs and tubes and line work would be the rest.)
  • Each brick is pre-'normalized' so that it exists within a unit cube; when we draw our brick instances, the transform matrix will include scaling factors to reconstitute the brick, not just the translation and rotation we have now.  (This means that we can know from the instance data where a brick will be.)
  • We set up an off-screen render target so that we can get direct access to the depth buffer; we'll blit the whole thing down to screen later, and it'll be cheap (one single-pass screen-sized blit with no processing).
  • We start drawing the way the engine works now: we traverse the data model and dynamically build instancing lists for all opaque parts.  (Translucent parts get dumped in a holding bay for later, slower, sorted processing; they won't participate in occlusion.)  This step exists in the current BrickSmith source code and is fast enough that CPU is not the bottleneck even on my older 2008 Mac Pro.
  • We draw the instance list with the crude occluder meshes. This should, in theory, complete quickly, because the vertex count is low, early Z can take effect, there is no blending, and our shaders are not complicated.
So far what we have basically done is rapidly put down a pre-Z style pass with some of our geometry. If our only goal was to limit fill rate, this would be a pretty good setup.  But we need to eliminate actual drawing calls.

Our next step is to grab the depth buffer and use render-to-texture to ping-pong it, forming a hierarchy of depth buffers.  For each depth buffer we will use a "farthest" policy, so that a low-res depth texel contains the farthest depth of the four high-res depth texels that it covers.

Now we can run GPU culling.  For each instance list VBO, we run it through a geometry shader, treating one instance as one vertex.  (In BrickSmith, instances are a 4x3 affine transform plus two RGBA colors, so our 20 floats can be five attributes in the vertex stream.)  The geometry shader calculates the screen space AABB of the instance, fetches depth from the appropriate level of the Z buffer, and determines whether the brick's bounding box is clearly behind every single pixel already drawn into the AABB's location.  The geometry shader then outputs zero or one vertices depending on the cull.

The geometry shader must be fed into a transform feedback stage so that we can (1) capture the vertices back rather than render them and (2) so the GPU will retain the number of vertices for later use in a glDrawTransformFeedbackStreamInstanced call.

(We could implement frustum culling here too, but I think we need to keep the original CPU frustum cull from the initial draw in order to not draw the entire model off-screen when we are setting up our early-Z buffer.)

Note that BrickSmith currently uses one giant VBO for its instance list - the individual bricks know which part of the instance list they own, so we save on VBO count.  However, if we want to reduce the instance count we have a serious problem: our segment buffering will be borked.  One way to solve this is to transform each primitive into its own transform feedback VBO, so that we can know how many primitives to draw and where they start.  This would probably be bad for total draw performance.

(There might be a better way to organize this, perhaps with multi-draw-indirect or compute, or atomic writes to a memory buffer.  I have not dug into this because, as you will see from later in the post, this scheme is not actually shippable.)

Finally, with the instance lists culled, we can go and draw 'the rest' of the model; since the instance lists have been trimmed, we skip most of the geometry for every occluded brick.  Thus we will get a large vertex savings on every occluded brick.  (For example, a 2x4 brick will save better than 97% of its vertices, even after excluding lines, when occlusion culled.)

There are a few reasons why we can't actually ship this.
  • The biggest single problem that I see is that the part library would have to be pre-tagged into rough occluder geometry and the rest.  Given the size of the part library and its structure as a series of "subroutine"-style partial primitives, such a partitioning would both be a huge manual undertaking (that could in theory be automated) and it would result in a transformation of the parts that could not be fed back into the LDraw system.
  • The code above assumes the most modern GPU features.  Instanced transform feedback is an OpenGL 4.2 feature, so it would require a DX11-class GPU.  Apple has not yet shipped OpenGL 4.0, so at best we'd be dropping support for Lion and Mountain Lion.
  • Using separate VBOs for each brick (to get the cull list) would be a serious performance problem.  It might be possible to overcome this with other new OpenGL techniques, but the requirement on a new driver would remain. GPU-culling could also be performed only on parts for which there was a large part count, applying an 80-20 rule to culling.
  • Finally, the technique above is very complicated - it is probably more implementation work than the entire renderer-rewrite was, and BrickSmith is not a commercial project; those of us who poke at it do so in our very, very limited spare time.  I don't thin anyone involved in the project has time for such serious GPU work.
Edit: reading Daniel R├íkos' article more carefully, I see he uses an asynchronous query to get the amount of feedback delivered.  In theory this would let us source our instances out of a single VBO even after culling, which would be a performance win.  However, I am not sure that this would be completely free. The algorithm without this modification can run entirely asynchronously - that is, the entire frame can be "late" and still render.  Pulling back the instance count to the CPU therefore might introduce the first sync point and hurt performance.  (E.g. if the previous frame hasn't completed, the next frame's query for culling count will block because the GPU wans't available.)

The conclusion is, of course, just to use crude reduced-LOD models for far viewing and go home happy.  But having spent the time considering what it would take to do GPU culling, I figured I should at least write it down.

Sunday, August 25, 2013

What Does gpus_ReturnGuiltyForHardwareRestart Mean?

First, a bunch of disclaimers: I do not know how the PowerVR SGX hardware really works, I do not have access to either the Apple OpenGL source or the PowerVR SGX driver source, and therefore what follows is all speculative engineering (read: guess work).  So if what follows is wrong, you've been warned.

With that out of the way, you're sitting in your office sipping iced coffee and coding OpenGL ES 2.0 subroutines for IOS.  You test on the device and get an access violation with this back-trace.
frame #0: 0x31e23916 libGPUSupportMercury.dylib`gpus_ReturnGuiltyForHardwareRestart + 10 frame #1: 0x31e24398 libGPUSupportMercury.dylib`gpusSubmitDataBuffers + 104 frame #2: 0x2c38888c IMGSGX543GLDriver`SubmitPackets + 124 frame #3: 0x2f74d6be GLEngine`gleCleanupOrphans + 130 frame #4: 0x2f72151e GLEngine`glBufferData_Exec + 254 frame #5: 0x010be8e6 libglInterpose.dylib`buffer_data(__GLIContextRec*, unsigned int, long, void const*, unsigned int) + 158 frame #6: 0x2f7b1c26 OpenGLES`glBufferData + 38
The question is: WTF?  What does this mean?  You look at your call to glBufferData and it looks (1) totally sane and (2) seems to have exploded on the 405th invocation.  Why did it just stop working now?

From what I can tell, you get an intentional crash in gpus_ReturnGuiltyForHardwareRestart when something goes wrong on the GPU itself, asynchronously, that the driver failed (or didn't bother) to catch.

For example, if you are calling glDrawElements with VBOs bound for both the element array and all vertex buffer sources, then the draw call will be hardware accelerated*: the driver will write some commands to the command buffer to draw from addresses in the GPU address space (which I guess is just system memory for an iPhone) and the GPU will later read the command and start fetching elements.

If you have screwed up and requested an element index that is out of bounds for the vertex arrays, the driver won't notice, because it is not copying your vertex data (and fortunately isn't wasting time bounds-checking your index buffer).  Instead the GPU will eventually start fetching vertices directly, and when it notices that one of the vertex fetches went out of bounds, it's going to make a note for the driver to fetch later.

So to sum up so far: you do a hardware-accelerated glDrawElements with bad index data (or the wrong VBO bound, or any other way to get out-of-bounds vertex fetches) and some time later the GPU gets around to executing your command (which has not been pre-checked), notes that it went out of bounds, and leaves a note for the driver to pick up.

So now we can look at why we blew up in glBufferData, a seemingly unrelated call.  glBufferData called into GLEngine (which is Apple's common OpenGL engine), which eventually had to talk to the specific PowerVR SGX driver.  (Apple's OpenGL stack is made of a common front-end that Apple produces, and back-end drivers that talk to specific hardware.)  The SGX driver at that point goes to talk to the hardware and discovers that since its last check, something really bad happened (E.g. we went out of bounds in our draw call).  The SGX driver then calls Apple back, calling gpus_ReturnGuiltyForHardwareRestart, which I guess is Apple's way to have a GPU vendor's driver tell them that the GPU itself seg faulted.

What makes these crashes exciting is their synchronous nature: the call that you get the crash in is (1) not the call that offended OpenGL and (2) affected by timing on both the CPU, and GPU (because the crash comes in the first CPU call to talk to the GPU after the GPU detects the problem, which happens whenever it has time to get to your draw call).  So the normal technique (comment things out and see what changes) just moves the crash around.

Based on this speculation, the way I fixed the problem was: I wrote a (very slow) debug routine to check the indices of all glDrawElements and glDrawArrays calls, mapping back the buffers as needed, and asserting that things are sane.  This immediately caught our real problem: a client-array draw call was failing to unbind VBOs - by luck the call before had changed to leave a VBO bound.  The client-array call was now drawing out of VBO memory, not its own, and since the VBO was smaller than the client array being specified, the draw call would run off the end of the VBO.

Because we have macros that map all glDrawElements and glDrawArrays calls to our own routines (that then call the real glDrawElements and glDrawArrays calls) adding this error checking was quite trivial; why we have those macros will have to be another blog post.

* Well, maybe it will be hardware accelerated.  If you have poorly aligned vertex data or use a wonky format, you might fall back to software vertex transfer.  This is fairly easily spotted in Instruments because your draw call will have routines with IMM below them in the trace, and a lot of CPU time will be spent immediately copying your vertices.  Accelerated draw calls themselves take almost no time to execute other than time to update GL state (which is also usually visible in Instruments).