Tuesday, October 29, 2013

Good Advice From GIT

I found this in a GIT man page.  (Emphasis added by me.)
Run 3-way file level merge for text files, but take lines from both versions, instead of leaving conflict markers. This tends to leave the added lines in the resulting file in random order and the user should verify the result. Do not use this if you do not understand the implications.
I thought that was pretty good advice for all programming (but it does apply double to GIT).

Wednesday, October 02, 2013

3-d Mouse Testing

I just finished fixing a long-running bug in BrickSmith's culling code; this post is a review of ways to do mouse testing of a 3-d model.

Modelview Space

The original BrickSmith code tested the mouse in the modelview-space of the entire model:
  • The mouse click is treated as a ray in normalized device coordinates (that is, it goes from Z=-1 to Z=1 at an X and Y that are proportional to the click on the viewport).
  • The mouse click ray is transformed by the inverse of the model-view + projection matrices to get a test ray in model-view space.
  • Each test triangle is forward transformed by its relative transformation before testing.
  • The ray and triangle are tested via something like Möller-Trumbore.
There are a few things to note about this original approach:
  • We need the depth of intersection to sort our hits, and we get that from the algorithm for free, which is nice.
  • The algorithm forward transforms every triangle (sad) but it also does not have to calculate the inverse matrices of any of the sub-transforms of our model. Since LDraw sub-transforms can be pretty much anything in a 4x3 matrix, not having to run a real 'inverse' is nice.
  • Because we aren't doing plane-based intersections or using matrix inverses, there is no need to cache any data to make the intersection tests faster.  Since BrickSmith is an editor, not having to precompute 'helper' data for our model  (that would need to be recalculated every time the user edits) is a win.
  • There is no culling or early out; a hit test looks at pretty much the entire model, which is not fast.
BrickSmith never shipped with a model-view-based approach to the marquee selection (which selects every triangle intersecting an axis-aligned bounding box (AABB) in screen space.

Had we stayed with a modelview-space approach, I believe the marquee selection would have been implemented by taking the intersection of all triangles on the positive side of six planes: four planes going through the camera position and the selection box, as well as the near and far clip planes.  In other words, the marquee selection forms a tighter-frustum in which we want to find intersections.

Had we had such a frustum test, it would have had two other uses:
  • Accelerating rendering (by discarding any rendering that was fully outside the 'big' frustum that is the viewport) and
  • If the frustum test is faster than our ray-triangle hit test (above), we can accelerate single-point testing by first culling the entire scene to a frustum that 'barely' surrounds our hit ray (e.g. via a marquee selection of just a few pixels).
Since there was no marquee selection or fast frustum test in the original code, there was no utility to accelerate point testing or drawing.  Mousing over very large models could be very slow (because the real-time coordinates-under-mouse display has to do a hit test to determine the depth under the cursor) and rendering very large models was slow even if they were mostly off screen.

Screen Space

When I looked at getting a frustum test going (to implement marquee selection), I came up with an idea that I thought at the time would be very clever: work entirely in screen space.  Working in screen space creates a more fundamentally limited API (because all queries must be specified in screen space points and AABBs, whereas the old code could test any ray in any direction) but BrickSmith was only using screen-space tests, and they can potentially be done quite a bit quicker.  Here's how the new code worked:
  • Point queries and AABB queries are always provided in normalized device coordinates (with returned depths also in NDC from -1 to 1).
  • To test triangles, we forward transform them all the way through their complete matrix (model view and projection) and then do a perspective divide.
  • The resulting primitive is 2-d in its extent and can be tested very cheaply.  In particular, the AABB test is dirt cheap in 2-d.
We now have our frustum test and can thus rapidly reject things that are either off screen or far from our click point.  To make things even faster, the code tests the AABB of containers of parts first (by trnasforming the AABB from its local model space to screen space, building a new AABB around it, and testing that).  This hierarchical culling of bounding boxes is what provides the real speed.  Given an MPD model with most of the sub-models off screen, we pay almost nothing for them.

One more win: because the frustum surrounding a single mouse-click test is so tiny, almost everything in a model is rejected early when testing for the mouse, so mouse-over-the-model is fast even when a huge model is entirely on screen.


At this point I was feeling pretty clever and smug (because for a really big model, the hierarchical culling could give you a 10x or 50x speedup) and happy (that my models were usable again).  But of course if you have been working with 3-d graphics for a while, you're probably face-palming yourself at the big leak in the abstraction of working in clip space.

Here's the thing: BrickSmith normally keeps your entire model behind the near clip plane.  It's like your model is sitting on a table and you are hovering over it; you never actually jam your face into the model.  So the fact that I was working in screen space was not a problem because the entire model always had a sane representation in clip space.

Until it didn't.  I implemented a "walk-through" camera that would let the user get inside their model and look around.  For users making modular buildings, this leads to some pretty cool screen-shots.  But it also led to low framerate and (even weirder) whole parts of the model just disappearing randomly!  I looked at the math over and over and there just wasn't something broken.

Of course what was broken was not a math mistake, it was a lack of clipping.  The problem is that in screen space, geometry in front of the near clip plane is eliminated by the GPU; therefore we don't have to worry about the strange things that happen in front of the near clip plane.  But in my hit-testing code, those vertices were being processed as is.

After a standard glFrustum transformation, vertices in clip space (homogeneous coordinates) have their -Z in their W coordinate which will become a divisor to the final device-space coordinate.  X and Y have been multiplied by the near clip plane.  So:
  • At the near clip plane, geometry does not change size.  In other words, you have a simple 2-d coordinate system.
  • Behind the near clip plane, things get smaller as they get farther away - this is not astonishing.
  • In front of the near clip plane, things get really big, really fast as we divide by a fractional Z.
  • At Z = 0 (parallel to the camera) we get a divide-by-zero, which is exciting.
  • Behind the camera, the divisor is negative (Z is positive, but the W coordinate has been negated already to get the rest of rendering to be not insane), so everything is mirrored in X and Y!
So to review, we have INFs floating around our coordinates, some triangle vertices may have been mirrored around the origin, and some geometry might be scaled to be "freaking huge" if it's between the near clip plane and 0 in camera coordinates.

The disappearing models happen when we have a model-view-space AABB that is to the left side of the screen in front of the camera and to the right behind the camera (because it is at a 45 degree angle to the eye-space Z axis, for example); in this case the "behind us" part of the AABB flips, and the whole AABB is to the left side of the screen in screen space and gets culled.

(This case happens a lot when we have long, thin models that we are in the middle of, viewed at a 45 degree angle to their long axis.   In other words, walk around a modular city and the city is going to disappear on a regular basis as you pan the camera.)

The fix I found and implemented is stupid but effective: clip all transformed triangles in clip space before the perspective divide.
  • The AABB is treated as a cube and each of its 12 edges are clipped to produce a new point set.  Since the AABB was going to have a new AABB built around its 12 points, doing this with a clipped point set isn't much different.
  • Triangles are clipped, sometimes resulting in zero or two triangles, before the perspective divide.
Besides fixing mouse testing, this clipping improves rendering speed in walk-through views significantly: geometry that is fully behind the near clip plane is completely removed from consideration, even if its projection would be on-screen; similarly an AABB that is in front of us but off screen to the left and behind us directly is clipped and the AABB tightened to not contain a 'false intersection' in front of us.

Other Options

There were a few other options I considered but haven't had time to code.

One option would be to do what X-Plane does (for culling) and work in eye-space, using side-of-plane tests for our frustum culling.  The test (for points) is cheap and would work without clipping.

(X-Plane does its ray-triangle mouse-testing in the coordinate system of the model itself; because all of the local transforms are simple rotations or translations, applying the inverse to the test ray is really cheap, and it avoids having to transform the entire model.)

Another option would be to work in homogeneous coordinates.  I am still not sure if this is mathematically possible, but the idea would be to do the ray-point intersection in clip space, and then check whether the clip-space intersection is behind the near clip plane.  This would let us do a single ray-triangle intersection without special-casing out the clipping logic, then reject a single point later.

Saturday, August 31, 2013

Theoretical Engineering: Occlusion Culling for BrickSmith

I have been pushing for a formal level-of-detail scheme for LDraw; the performance problem is that LDraw bricks contain a fairly high vertex count; at low zoom a large model might be entirely on screen, resulting in frame-rate limited by pure vertex count.  (The vertices are on the GPU, the batches are drawn with array-attribute-divisor instancing, and the destination viewport is small with simple shaders, so the limiting factor is trying to cram 125M vertices through the GPU.)

During the discussion, some developers brought up the question of occlusion culling.  In many of these wide views, a fair amount of the geometry is going to have no contribution in the final image due to occlusion by other, closer bricks.

Occlusion culling is tricky stuff. Occlusion queries on the GPU are not available to the CPU until drawing completes; recent GPUs are finally gaining the ability to use the occlusion query result directly, eliminating back-to-the-CPU latency.  CPU-side occlusion queries are usually facilitated by heavy pre-processing of the source data, but that isn't an easy option in BrickSmith because it is an editor, not a viewer (and therefore the source data is continually changing).

What follows is speculative engineering, e.g. how could occlusion culling work on a modern GPU. Looking at the requirements for the design we can then infer as to whether such a scheme is feasible. (Spoiler alert: it is not.)

When looking at the algorithm that follows, please keep one thing in mind: the limiting factor here is vertex count - that's what we are trying to optimize.  This is a very unusual situation in 3-d graphics; most systems are designed to not have this problem. In particular, we are not fill rate bound, so we can expect GPU commands involving small numbers of vertices to complete reasonably quickly, because the GPU isn't drowning in last-frames pixel-fill operations.

Here's the basic idea:
  • Every part that we draw is divided into two sets of geometry: the crude occluders and the rest of the part.  The crude occluders are the non-transparent triangles and quads (no lines) that are big enough to be worth drawing early to start filling space.  The goal is to keep fill rate high while lowering vertex count, so that we can fill a lot of the screen quickly and then see what is left. (For example, for a 2x4 brick, the outer 5 quads of the brick top and sides would be the crude occluders; the entire interior, edging, studs and tubes and line work would be the rest.)
  • Each brick is pre-'normalized' so that it exists within a unit cube; when we draw our brick instances, the transform matrix will include scaling factors to reconstitute the brick, not just the translation and rotation we have now.  (This means that we can know from the instance data where a brick will be.)
  • We set up an off-screen render target so that we can get direct access to the depth buffer; we'll blit the whole thing down to screen later, and it'll be cheap (one single-pass screen-sized blit with no processing).
  • We start drawing the way the engine works now: we traverse the data model and dynamically build instancing lists for all opaque parts.  (Translucent parts get dumped in a holding bay for later, slower, sorted processing; they won't participate in occlusion.)  This step exists in the current BrickSmith source code and is fast enough that CPU is not the bottleneck even on my older 2008 Mac Pro.
  • We draw the instance list with the crude occluder meshes. This should, in theory, complete quickly, because the vertex count is low, early Z can take effect, there is no blending, and our shaders are not complicated.
So far what we have basically done is rapidly put down a pre-Z style pass with some of our geometry. If our only goal was to limit fill rate, this would be a pretty good setup.  But we need to eliminate actual drawing calls.

Our next step is to grab the depth buffer and use render-to-texture to ping-pong it, forming a hierarchy of depth buffers.  For each depth buffer we will use a "farthest" policy, so that a low-res depth texel contains the farthest depth of the four high-res depth texels that it covers.

Now we can run GPU culling.  For each instance list VBO, we run it through a geometry shader, treating one instance as one vertex.  (In BrickSmith, instances are a 4x3 affine transform plus two RGBA colors, so our 20 floats can be five attributes in the vertex stream.)  The geometry shader calculates the screen space AABB of the instance, fetches depth from the appropriate level of the Z buffer, and determines whether the brick's bounding box is clearly behind every single pixel already drawn into the AABB's location.  The geometry shader then outputs zero or one vertices depending on the cull.

The geometry shader must be fed into a transform feedback stage so that we can (1) capture the vertices back rather than render them and (2) so the GPU will retain the number of vertices for later use in a glDrawTransformFeedbackStreamInstanced call.

(We could implement frustum culling here too, but I think we need to keep the original CPU frustum cull from the initial draw in order to not draw the entire model off-screen when we are setting up our early-Z buffer.)

Note that BrickSmith currently uses one giant VBO for its instance list - the individual bricks know which part of the instance list they own, so we save on VBO count.  However, if we want to reduce the instance count we have a serious problem: our segment buffering will be borked.  One way to solve this is to transform each primitive into its own transform feedback VBO, so that we can know how many primitives to draw and where they start.  This would probably be bad for total draw performance.

(There might be a better way to organize this, perhaps with multi-draw-indirect or compute, or atomic writes to a memory buffer.  I have not dug into this because, as you will see from later in the post, this scheme is not actually shippable.)

Finally, with the instance lists culled, we can go and draw 'the rest' of the model; since the instance lists have been trimmed, we skip most of the geometry for every occluded brick.  Thus we will get a large vertex savings on every occluded brick.  (For example, a 2x4 brick will save better than 97% of its vertices, even after excluding lines, when occlusion culled.)

There are a few reasons why we can't actually ship this.
  • The biggest single problem that I see is that the part library would have to be pre-tagged into rough occluder geometry and the rest.  Given the size of the part library and its structure as a series of "subroutine"-style partial primitives, such a partitioning would both be a huge manual undertaking (that could in theory be automated) and it would result in a transformation of the parts that could not be fed back into the LDraw system.
  • The code above assumes the most modern GPU features.  Instanced transform feedback is an OpenGL 4.2 feature, so it would require a DX11-class GPU.  Apple has not yet shipped OpenGL 4.0, so at best we'd be dropping support for Lion and Mountain Lion.
  • Using separate VBOs for each brick (to get the cull list) would be a serious performance problem.  It might be possible to overcome this with other new OpenGL techniques, but the requirement on a new driver would remain. GPU-culling could also be performed only on parts for which there was a large part count, applying an 80-20 rule to culling.
  • Finally, the technique above is very complicated - it is probably more implementation work than the entire renderer-rewrite was, and BrickSmith is not a commercial project; those of us who poke at it do so in our very, very limited spare time.  I don't thin anyone involved in the project has time for such serious GPU work.
Edit: reading Daniel Rákos' article more carefully, I see he uses an asynchronous query to get the amount of feedback delivered.  In theory this would let us source our instances out of a single VBO even after culling, which would be a performance win.  However, I am not sure that this would be completely free. The algorithm without this modification can run entirely asynchronously - that is, the entire frame can be "late" and still render.  Pulling back the instance count to the CPU therefore might introduce the first sync point and hurt performance.  (E.g. if the previous frame hasn't completed, the next frame's query for culling count will block because the GPU wans't available.)

The conclusion is, of course, just to use crude reduced-LOD models for far viewing and go home happy.  But having spent the time considering what it would take to do GPU culling, I figured I should at least write it down.

Sunday, August 25, 2013

What Does gpus_ReturnGuiltyForHardwareRestart Mean?

First, a bunch of disclaimers: I do not know how the PowerVR SGX hardware really works, I do not have access to either the Apple OpenGL source or the PowerVR SGX driver source, and therefore what follows is all speculative engineering (read: guess work).  So if what follows is wrong, you've been warned.

With that out of the way, you're sitting in your office sipping iced coffee and coding OpenGL ES 2.0 subroutines for IOS.  You test on the device and get an access violation with this back-trace.
frame #0: 0x31e23916 libGPUSupportMercury.dylib`gpus_ReturnGuiltyForHardwareRestart + 10 frame #1: 0x31e24398 libGPUSupportMercury.dylib`gpusSubmitDataBuffers + 104 frame #2: 0x2c38888c IMGSGX543GLDriver`SubmitPackets + 124 frame #3: 0x2f74d6be GLEngine`gleCleanupOrphans + 130 frame #4: 0x2f72151e GLEngine`glBufferData_Exec + 254 frame #5: 0x010be8e6 libglInterpose.dylib`buffer_data(__GLIContextRec*, unsigned int, long, void const*, unsigned int) + 158 frame #6: 0x2f7b1c26 OpenGLES`glBufferData + 38
The question is: WTF?  What does this mean?  You look at your call to glBufferData and it looks (1) totally sane and (2) seems to have exploded on the 405th invocation.  Why did it just stop working now?

From what I can tell, you get an intentional crash in gpus_ReturnGuiltyForHardwareRestart when something goes wrong on the GPU itself, asynchronously, that the driver failed (or didn't bother) to catch.

For example, if you are calling glDrawElements with VBOs bound for both the element array and all vertex buffer sources, then the draw call will be hardware accelerated*: the driver will write some commands to the command buffer to draw from addresses in the GPU address space (which I guess is just system memory for an iPhone) and the GPU will later read the command and start fetching elements.

If you have screwed up and requested an element index that is out of bounds for the vertex arrays, the driver won't notice, because it is not copying your vertex data (and fortunately isn't wasting time bounds-checking your index buffer).  Instead the GPU will eventually start fetching vertices directly, and when it notices that one of the vertex fetches went out of bounds, it's going to make a note for the driver to fetch later.

So to sum up so far: you do a hardware-accelerated glDrawElements with bad index data (or the wrong VBO bound, or any other way to get out-of-bounds vertex fetches) and some time later the GPU gets around to executing your command (which has not been pre-checked), notes that it went out of bounds, and leaves a note for the driver to pick up.

So now we can look at why we blew up in glBufferData, a seemingly unrelated call.  glBufferData called into GLEngine (which is Apple's common OpenGL engine), which eventually had to talk to the specific PowerVR SGX driver.  (Apple's OpenGL stack is made of a common front-end that Apple produces, and back-end drivers that talk to specific hardware.)  The SGX driver at that point goes to talk to the hardware and discovers that since its last check, something really bad happened (E.g. we went out of bounds in our draw call).  The SGX driver then calls Apple back, calling gpus_ReturnGuiltyForHardwareRestart, which I guess is Apple's way to have a GPU vendor's driver tell them that the GPU itself seg faulted.

What makes these crashes exciting is their synchronous nature: the call that you get the crash in is (1) not the call that offended OpenGL and (2) affected by timing on both the CPU, and GPU (because the crash comes in the first CPU call to talk to the GPU after the GPU detects the problem, which happens whenever it has time to get to your draw call).  So the normal technique (comment things out and see what changes) just moves the crash around.

Based on this speculation, the way I fixed the problem was: I wrote a (very slow) debug routine to check the indices of all glDrawElements and glDrawArrays calls, mapping back the buffers as needed, and asserting that things are sane.  This immediately caught our real problem: a client-array draw call was failing to unbind VBOs - by luck the call before had changed to leave a VBO bound.  The client-array call was now drawing out of VBO memory, not its own, and since the VBO was smaller than the client array being specified, the draw call would run off the end of the VBO.

Because we have macros that map all glDrawElements and glDrawArrays calls to our own routines (that then call the real glDrawElements and glDrawArrays calls) adding this error checking was quite trivial; why we have those macros will have to be another blog post.

* Well, maybe it will be hardware accelerated.  If you have poorly aligned vertex data or use a wonky format, you might fall back to software vertex transfer.  This is fairly easily spotted in Instruments because your draw call will have routines with IMM below them in the trace, and a lot of CPU time will be spent immediately copying your vertices.  Accelerated draw calls themselves take almost no time to execute other than time to update GL state (which is also usually visible in Instruments).

Monday, July 22, 2013

The Smell of Victory

You might think the primary purpose of X-Plane is to help people enjoy the experience of flight at home, or to help pilots train for real-world situations.  You would be wrong.

X-Plane's real purpose is to slowly drive the last traces of sanity out of OpenGL driver-writer's already deep-fried minds.  X-Plane is joined in this noble calling by a legion of other sometimes-badly-behaved legacy applications that are shot through with conditional logic based on random OS and driver versions.

And you know...I think we're winning.  From comment 9 of GL_ARB_buffer_storage (just released as part of Core OpenGL 4.4):
  9) What is the meaning of CLIENT_STORAGE_BIT? Is it one of those
       silly hint things?

       DISCUSSION: Unfortunately, yes, it is. For some platforms, such as UMA
       systems, it's irrelevant and all memory is both server and client
       accessible. The issue is, that on some platforms and for certain
       combinations of flags, there may be multiple regions of memory that
       can satisfy the request (visible to both server and client and coherent
       to both, for example), but may have substantially different performance
       characteristics for access from either. This bit essentially serves
       as a hint to say that that an application will access the store more
       frequently from the client than from the server. In practice,
       applications will still get it wrong (like setting it all the time or
       never setting it at all, for example), implementations will still have
       to second guess applications and end up full of heuristics to figure out
       where to put data and gobs of code to move things around based on what
       applications do, and eventually it'll make no difference whether
       applications set it or not. But hey, we tried.
I would like to suggest to the ARB that, in the future, they name the hints things like GL_REALLY_FAST_BIT and GL_NO_REALLY_THIS_BUFFER_NEEDS_TO_BE_FAST_BIT.  You guys are going to ignore the hint bits anyway; the hint bits might as well directly express what we are trying to code.

Friday, May 10, 2013

Moving the Needle - a Quick Audit of the 7970

I dropped X-Plane into ATIAMD's GPU PerfStudio 2 to see what might be going on on the GPU side of things.  This is on a Sandy Bridge i5 with a 7970 on a 1920 x 1200 monitor.  I cranked the sim up to 4x SSAA (stupidly simplistic anti-aliasing) to really push fill rate.



We're mostly CPU bound.  Still.

You know those GDC talks where the first thing the IHVs say is "most games are CPU bound?"  I can feel them glaring in my direction.

There was one case where I was able to absolutely hose the GPU: a full screen of thick particle-type clouds at 1080p + with 4x SSAA.  You can tell you're GPU bound just by the amount of heat the card blows out.

What was cool though, was running a batch profile in GPU PerfStudio and sorting by GPU time.  At a whopping, facemelting, eye bleeding 32 ms was a single batch of cloud puffs (about 10k vertices) that covered most of the screen.  The histogram falls off from there, with the next few most expensive cloud batches taking a few ms and everything else being noise.

This isn't hugely surprising...we know our cloud system is a fill-rate pig and thus responds badly to very large render surfaces...the real surprise is how the GTX 680 copes with them so well.  (What is less obvious is what to do about it; I fear the halo-type artifacts that a lot of half-res solutions provide may be a deal-breaker; re-rendering the edges with stencil will increase our geometry count and we do have a lot of puffs.  Probably the right thing is to start using hardware MSAA surfaces for the G-Buffer to leverage hardware compression.)

I went looking for anything else that might be slow and finally got an answer about pre-stenciling lighting volumes.  Our first deferred rendering implementation pre-stenciled lighting volumes to cut fill rate; we dropped the stencil pass when we found that we were CPU and AGP bandwidth bound; the only time losing stencil was a problem was in high-zoom lighting scenarios.

With the GPU profiler, I could see that a moderate sized batch of light volumes around the Apron at our Seattle demo airport takes about 1.5 ms to render at the aforementioned rather large resolution.  The scene has maybe 3 or 4 volumes of that magnitude, and the rest are small enough in screen space that we don't need to care.

That only adds up to 6-10 ms of GPU time though - and given that the sun pass is fast enough to not show up in the top 10 list, FXAA is fast and even scenery fill isn't so bad, it's clear why light fill isn't the long pole, particularly when the CPU is struggling to get around the frame in 30 ms.  Cutting CPU work does appear to be the right thing here.

The real right thing, some day, in the future, when I have, like, spare time, would be to do two scene graph passes: the first one would draw lights except in the near N meters, with no stencil; the second pass would only grab the near lights and would do two passes, stenciling out the lights first.  This would give us the fill rate fix in the one case where it matters: when the light is close enough to be huge in screen space.  (Our non-sun lights tend to be reasonably small in world space - they are things like street lights, apron lights, and airplane landing lights.)

There is one limitation to GPU PerfStudio2 that frustrated me: because it doesn't sniff the call-stack during draw calls, it can't easily do data mining for the source of poor performance.  That is, if you have a big app that generates a huge frame using a number of common subsystems, if one subsystem sucks, it doesn't data mine that for you.  (Note: I did not experiment with trying to inject frame markers into the draw call stream...I don't even know if they support that using the KHR debug extension.)

My next step will be to integrate the NV and ATI performance counter APIs directly into the sim.  We have, at various times, had various timing utilities to allow us to rapidly instrument a code section in a way that follows the logical design, rather than just the call stack.  (Shark was so good that we didn't use any utilities for a while.)  With the GPU performance counters, we could potentially build HUD-style GPU metering directly into the app.

Saturday, April 20, 2013

There Must Be 50 Ways to Draw Your Streaming Quads

So you want to draw a series of streaming 2-d quads.  (By streaming I mean: their position or orientation is changing per frame.)  Maybe it's your UI, maybe it is text, maybe it is some kind of 2-d game or particle system.  What is the fastest way to draw them?

For a desktop, one legitimate response might be "who cares"?  Desktops are so fast now that it would be worth profiling naive code first to make sure optimization is warranted.  For the iPhone, you probably still care.

Batch It Up

The first thing you'll need to do is to put all of your geometry in a single VBO, and reference only a single texture (e.g. a texture atlas). This is mandatory for any kind of good performance.  The cost of changing the source buffer or texture (in terms of CPU time spent in the driver) is quite a bit larger than the time it takes to actually draw a single quad.  If you can't draw in bulk, stop now - there's no point in optimizing anything else.

(As a side note: you'll eat the cost of changing VBOs even if you simply change the base address pointer within the same VBO.  Every time you muck around with the vertex pointers or vertex formats, the driver has to re-validate that what you are doing isn't going to blow up the GPU.  It's incredible how many cases the driver has to check to make sure that a subsequent draw call is reasonable.  glVertexPointer is way more expensive than it looks!)

The Naive Approach

The naive approach is to use the OpenGL transform stack (e.g. glRotate, glTranslate, etc.) to position each quad, then draw it.  Under the hood, the OpenGL transform stack translates into "uniform" state change - that is, changing state that is considered invariant during a draw call but changes between draw calls.  (If you are using the GL 2.0 core profile, you would code glRotate/glTranslate yourself by maintaining a current matrix and changing a uniform.)

When drawing a lot of stuff, uniforms are your friend; because they are known to be uniform for a single draw call, the driver can put them in memory on the GPU where they are quick to access over a huge number of triangles or quads.  But when drawing a very small amount of geometry, the cost of changing the uniforms (more driver calls, more CPU time) begins to outweigh the benefit of having the GPU "do the math".

In particular, if each quad has its own matrix stack in 2-d, you are saving 24 MADs per quad by requiring the driver to rebuild the current uniform state.  (How much does that cost?  A lot more than 24 MADs.)  Even ignoring the uniforms, the fact that a uniform changed means each draw call can only draw 1 quad.  Not fast.

Stream the Geomery

One simple option is to throw out hardware transform on the GPU and simply transform the vertices on the CPU before "pushing" them to the GPU.  Since the geometry of the quads are changing per frame, you were going to have to send them to the GPU anyway.  This technique has a few advantages and disadvantages.
  • Win: You get all of your drawing in a single OpenGL draw call with a single VBO.  So your driver time is going to be low and you're talking to the hardware efficiency.
  • Win: This requires no newer GL 3.x/4.x kung fu.  That's good if you're using OpenGL 2.0 ES on an iPhone, for example.
  • Fail: You have to push every vertex every frame. That costs CPU and (on desktops) bus bandwidth.
  • Not-total-fail: Once you commit to pushing everything every frame, the cost of varying UV maps in real-time has no penalty; and there isn't a bus to jam up on a mobile device.
Note that if we were using naive transforms, we'd still have to "push" a 16-float uniform matrix to the card (plus a ton of overhead that goes with it), so 16 floats of 2-d coordinates plus texture is a wash.  As a general rule I would say that if you are using uniforms to transform single primitives, try using the CPU instead.

Stupid OpenGL Tricks

If you are on a desktop with a modern driver, you can in theory leverage the compute power of the GPU, cut down your bandwidth, and still avoid uniform-CPU-misery.

Disclaimer: while we use instancing heavily in X-Plane, I have not tried this technique for 2-d quads.  Per the first section, in X-Plane desktop we don't have any cases where we care enough.  The streaming case was important for iPhone.

To cut down the amount of streamed data:
  • Set the GPU up for vertex-array-divisor-style instancing.
  • In your instance array, push the transform data.  You might have an opportunity for compression here; for example, if all of your transforms are translate+2-d rotate (no scaling ever), you can pass a pair of 2-d offsets and the sin/cos of the rotation and let the shader apply the math ad-hoc, rather than using a full 4x4 matrix.  If your UV coordinates change per quad, you'll need to pass some mix of UV translations/scales.  (Again, if there is a regularity to your data you can save instance space.)
  • The mesh itself is a simple 4-vertex quad in a static VBO.
You issue a single instanced draw call and the card "muxes" together the instancing transforms with the vertex data.  You get a single batch, a small amount of data transferred over the bus, and low CPU use.

There are a few other ways to cut this up that might not be as good as hw instancing:
  • There are other ways to instance using the primitive ID and UBOs or TBOs - YMMV.
  • If you have no instancing, you can use immediate mode to push the transforms and make individual draw calls.  This case will probably outperform uniforms, but probably not outperform streaming and CPU transform.
  • You could use geometry shaders, but um, don't.