Wednesday, January 30, 2013

Instancing for BrickSmith

BrickSmith is a FOSS 3-d LDraw-compatible editor for Mac; basically it lets you model legos on your computer. BrickSmith is a wonderful program - really a joy to use.  I have submitted a few patches, mostly little features that I want for my own modeling.  Recently I rewrote the low level OpenGL drawing code to improve performance and quality; hopefully we'll ship this in the next major patch.

This post describes the OpenGL techniques I picked.  Since OpenGL is a cross-platform standard, it's possible that the design might be of interest (at least for reference) to other LDraw program developers.  In some cases I will reference X-Plane performance numbers because I have access to that data.

LDraw Basics

If you aren't familiar with LDraw's simulation of lego bricks, here are the operative points to a graphics programmer:
  • The file format effectively turns into something like a push-down state stack and individual line, tri, and quad primitives.
  • The vast majority of drawing tends to be colored lines and polygons; while texture support was added to the format, it's not yet in wide-spread production.
  • LDraw models tend to be vertex bound; the format contains no LOD for reducing vertex count, and the lego parts are modeled in full geometric detail.  (Consider: even though the lego 'studs' are only broken into octagon-prisms, you still have 1024 of them on a single baseplate.)

Basic OpenGL Decisions

The new renderer uses shaders, not the fixed function pipeline.  Because of this, I was able to program one or two LDraw-specific tricks into the shaders to avoid CPU work.

The shaders understand the LDraw concept of a "current" color (which is the top of a stack of color changes induced by the inclusion of sub-parts/sub-models) vs static hard-coded colors; a given part might be a mix of "this is red" and "fill in the blank".  I represent these meta-colors that come off the stack as special RGBA quadruples with A=0 and RGB having a special value; the shader can then pull off the current stack state and substitute it in.  This is important because it means that I can use a single mesh (with color) for any given part regardless of color changes (the mesh doesn't have to be edited) and I don't have to draw the mesh in two batches (which would cost CPU time).

BrickSmith "flattens" parts in the standard LDraw library into a single simple representation - in other words, the 'stack' is simulated and the final output is consolidated.  Thus a part is typically a single set of tris, lines, and draws, all stored in a single VBO, with no state change.  Thus the library parts are "atomic".  The VBO is loaded as STATIC_DRAW (because it is virtually never changed) for maximum speed.

Because LDraw models are currently flat shaded, BrickSmith does not attempt to index and share vertices; all VBOs are non-indexed, and a distinct set of GL_QUADS is maintained to avoid vertex duplication.

(I believe we would get a performance boost by indexing vertices, but only if smooth shading could be employed; with flat shading virtually all vertices have different normals and cannot be merged.)

Attribute Instancing

A naive way to draw part meshes would be to use glPushMatrix/glRotate/glTranslate sequences to push transforms to the GPU.  The problem with this technique is that it is CPU expensive; the built-in matrix transforms are uniform state, and on modern cards this uniform state has to live in a buffer where the GPU can read it

Thus each time you 'touch' the matrix state, the entire set of built-in uniforms including the ones you haven't messed with (your projection matrix, lighting values, etc.) get copied into a new chunk of buffer that must be sent to the card.  The driver doesn't know that you're only going to touch transform, so it can't be efficient.

That uniform buffer will then either be in AGP memory (requiring the card to read over the PCIe bus at least at first to draw) or it will have to be DMAed into VRAM (requiring the card to set up, schedule, and wait for a DMA transfer).  Either way, that's a lot of work to do per draw call, and it's going to limit the total number of draw calls we can have.

Remember, 5000 draw calls is "a lot" of draw calls for real-time framerates.  But 5000 bricks is only one or two big lego models.  If you model all of the modular houses and you just want to see them in realtime, that's 17,090 parts -- that's a lot of draw calls!

One trick we can do to lower the cost of matrix transforms is to store our model view matrix in vertex attributes rather than in a uniform.  Vertex attributes are very cheap to change (via glVertexAttrib4f) and it allows us to draw one brick many times with no uniform change (and thus all of that uniform work by the card gets skipped).  If we draw one brick many times, we can avoid a VBO bind, avoid uniform change, and just alternate glVertexAttribute4fv, glDrawArrays repeat.

This technique is sometimes called "immediate mode" instancing because we're using immediate mode to jam our instancing data down the pipe quickly.  For X-Plane on OS X, immediate mode instancing is about 2x faster than the built-in uniform/matrix transforms.

BrickSmith's new renderer code is built around a 24-float instance: a 4x4 matrix stored in 4 attributes, an RGBA current color, and an RGBA compliment color.

Hardware Instancing

One nice thing about using attributes to instance is that it makes using hardware instancing simple.  With hardware instancing, we give the hardware an array of 24-float "instances" (e.g. the position/colors of a list of the same brick in many places) and the brick itself, and issue a single draw call; the hardware draws the brick mesh for each instance location.

Hardware instancing is much faster than immediate mode instancing - YMMV but in X-Plane we see instancing running about 10x faster than immediate mode; X-Plane can draw over 100,000 individual "objects" when instancing is used - more than enough for our modular houses.

To use instancing we put the instance data into a VBO and use glVertexAttribDivisor to tell OpenGL that we want some attributes (the instance data) to be used once per model, while the other data is once per vertex.

For BrickSmith, the instance locations are GL_STREAM_DRAW - BrickSmith generates the list of bricks per frame as the renderer traverses the model.  So the bricks themselves are static but their locations are on the fly.  I chose this design because it was the simplest; BrickSmith has no pre-existing way to cache sets of brick layouts.  At 24 floats, even a 100,000 brick model is only about 10 MB of data per frame - well within the range of what we can push to the card.

(By comparison, X-Plane precomputes and saves sets of objects, so the instance location data is GL_STATIC_DRAW.)

Drawing Dispatch

The actual drawing code uses a mix of immediate mode instancing, hardware instancing, and straight drawing.  The logic goes something like this:
  • As draw calls come in, we simply accumulate information about what the app wants to draw.  Parts that are off screen are culled.  (Since we are bound on the GPU, we can afford the CPU time to reduce vertex count.)
  • If a part contains translucency, it goes in a special "sorted" bucket, which is drawn last, from back to front.  The sorting costs CPU time, so we only do this when we identify a part with translucency.
  • Parts with "stack complexity" (e.g. texturing) that need to be drawn in multiple draw calls go in another bucket, the "just draw it" bucket - they are sorted by part so that we can avoid changing VBOs a lot - changing VBOs takes driver time!
  • Parts that are simple go in the "instancing" bucket, and we accumulate a list of locations and parts (again, organized by part.)
When it comes time to draw the instancing bucket, we choose immediate mode or hardware instancing based on (1) whether the GPU has hardware instancing and (2) the number of instances; for very small number of instances, it's cheaper to push the attributes than to change the buffer bindings (which is necessary for hardware instancing). The exact cutoff will vary with app, but typically hardware pays for more than 3 instances.

Note that all instanced parts are written into a single giant stream buffer.  This lets us avoid mapping and unmapping the buffer over and over, and it lets us avoid having a huge number of small buffers.  
Generally fewer, larger VBOs are better - they're relatively expensive objects for the driver to manage; if your VBOs are less than one VM page, find a way to merge them.

Performance Results

The new renderer often runs about 2x faster than the existing code, while providing sorted transparency, and it typically runs at significantly lower CPU.

One case where it did not run faster was with Datsville - the Datsville model I have for testing is about 39,000 bricks, resulting in 125 million vertices.  It runs on my 4870 at about 5 fps.

In the old renderer, I would see 100% CPU and about 5 fps; with the new one, maybe 30-35% CPU and 5.1 fps.  Why no more speed?  It turns out that the total vertex capacity of the card is only about 500 million vertices/second, so the card is vertex bound. (This is quite rare for games.)  When the model is partly off-screen, framerate increases significantly.