Saturday, April 20, 2013

There Must Be 50 Ways to Draw Your Streaming Quads

So you want to draw a series of streaming 2-d quads.  (By streaming I mean: their position or orientation is changing per frame.)  Maybe it's your UI, maybe it is text, maybe it is some kind of 2-d game or particle system.  What is the fastest way to draw them?

For a desktop, one legitimate response might be "who cares"?  Desktops are so fast now that it would be worth profiling naive code first to make sure optimization is warranted.  For the iPhone, you probably still care.

Batch It Up

The first thing you'll need to do is to put all of your geometry in a single VBO, and reference only a single texture (e.g. a texture atlas). This is mandatory for any kind of good performance.  The cost of changing the source buffer or texture (in terms of CPU time spent in the driver) is quite a bit larger than the time it takes to actually draw a single quad.  If you can't draw in bulk, stop now - there's no point in optimizing anything else.

(As a side note: you'll eat the cost of changing VBOs even if you simply change the base address pointer within the same VBO.  Every time you muck around with the vertex pointers or vertex formats, the driver has to re-validate that what you are doing isn't going to blow up the GPU.  It's incredible how many cases the driver has to check to make sure that a subsequent draw call is reasonable.  glVertexPointer is way more expensive than it looks!)

The Naive Approach

The naive approach is to use the OpenGL transform stack (e.g. glRotate, glTranslate, etc.) to position each quad, then draw it.  Under the hood, the OpenGL transform stack translates into "uniform" state change - that is, changing state that is considered invariant during a draw call but changes between draw calls.  (If you are using the GL 2.0 core profile, you would code glRotate/glTranslate yourself by maintaining a current matrix and changing a uniform.)

When drawing a lot of stuff, uniforms are your friend; because they are known to be uniform for a single draw call, the driver can put them in memory on the GPU where they are quick to access over a huge number of triangles or quads.  But when drawing a very small amount of geometry, the cost of changing the uniforms (more driver calls, more CPU time) begins to outweigh the benefit of having the GPU "do the math".

In particular, if each quad has its own matrix stack in 2-d, you are saving 24 MADs per quad by requiring the driver to rebuild the current uniform state.  (How much does that cost?  A lot more than 24 MADs.)  Even ignoring the uniforms, the fact that a uniform changed means each draw call can only draw 1 quad.  Not fast.

Stream the Geomery

One simple option is to throw out hardware transform on the GPU and simply transform the vertices on the CPU before "pushing" them to the GPU.  Since the geometry of the quads are changing per frame, you were going to have to send them to the GPU anyway.  This technique has a few advantages and disadvantages.
  • Win: You get all of your drawing in a single OpenGL draw call with a single VBO.  So your driver time is going to be low and you're talking to the hardware efficiency.
  • Win: This requires no newer GL 3.x/4.x kung fu.  That's good if you're using OpenGL 2.0 ES on an iPhone, for example.
  • Fail: You have to push every vertex every frame. That costs CPU and (on desktops) bus bandwidth.
  • Not-total-fail: Once you commit to pushing everything every frame, the cost of varying UV maps in real-time has no penalty; and there isn't a bus to jam up on a mobile device.
Note that if we were using naive transforms, we'd still have to "push" a 16-float uniform matrix to the card (plus a ton of overhead that goes with it), so 16 floats of 2-d coordinates plus texture is a wash.  As a general rule I would say that if you are using uniforms to transform single primitives, try using the CPU instead.

Stupid OpenGL Tricks

If you are on a desktop with a modern driver, you can in theory leverage the compute power of the GPU, cut down your bandwidth, and still avoid uniform-CPU-misery.

Disclaimer: while we use instancing heavily in X-Plane, I have not tried this technique for 2-d quads.  Per the first section, in X-Plane desktop we don't have any cases where we care enough.  The streaming case was important for iPhone.

To cut down the amount of streamed data:
  • Set the GPU up for vertex-array-divisor-style instancing.
  • In your instance array, push the transform data.  You might have an opportunity for compression here; for example, if all of your transforms are translate+2-d rotate (no scaling ever), you can pass a pair of 2-d offsets and the sin/cos of the rotation and let the shader apply the math ad-hoc, rather than using a full 4x4 matrix.  If your UV coordinates change per quad, you'll need to pass some mix of UV translations/scales.  (Again, if there is a regularity to your data you can save instance space.)
  • The mesh itself is a simple 4-vertex quad in a static VBO.
You issue a single instanced draw call and the card "muxes" together the instancing transforms with the vertex data.  You get a single batch, a small amount of data transferred over the bus, and low CPU use.

There are a few other ways to cut this up that might not be as good as hw instancing:
  • There are other ways to instance using the primitive ID and UBOs or TBOs - YMMV.
  • If you have no instancing, you can use immediate mode to push the transforms and make individual draw calls.  This case will probably outperform uniforms, but probably not outperform streaming and CPU transform.
  • You could use geometry shaders, but um, don't.