Thursday, December 13, 2012

Static Libraries and Plugins: Global Pain

Years after the X-Plane plugin SDK was ported to operating systems that support Unix .a (archive) static libraries, I have finally come to understand what a mess global symbols can make with incorrect linker settings. The problem is that the incorrect linker settings are almost always the defaults. This blog post will explain what goes wrong with this kind of linker setup and how to fix it.  While this stuff might be obvious to those intimately familiar with Linux and Unix-style linking, it's a bit astonishing to anyone coming from the Windows and pre-OS X Mac world, where the assumptions about linkage are very different.

This post may also thoroughly slander Linux, and if I learn why I'm an idiot and the whole problem can be solved in a much better way, hey, that's great.  I'd much rather find out that there's a better way and I'm wrong than find out that things really are as broken as they seem.

Globally Symbols and Shared Libraries

Unix-style linkers (e.g. ld on both OS X and Linux) support a shared global namespace for symbols exported from shared libraries. Simply put, for any given symbol name, there can be only one 'real' implementation of that symbol, and the first dynamic library (or host app with dynamic linkage, which is basically all host apps these days) to introduce that symbol defines it for every dynamic library.

In other words, if you have five implementations of "void a()" in your dynamic libraries, the first one loaded is used by everyone.  It's a global namespace.

Note that if your symbol is not global, it will not be replaced by an earlier variant.  So if your symbol isn't global, other people having global symbols can't hose you.

The implications of this are clear: you should be very very careful and very very minimal about what gets exported into the global namespace, because of the risk of symbol collision.  I found a bug in an X-Plane plugin because the internal routine sasl_done (in a plugin called sasl) was global and the second instance loaded - sasl_done from libsasl2.dylib had already been loaded by the OS.  The results: a random call into a DLL when the plugin thought it was calling itself!

Unfortunately, the default for GCC is to put everything into the global namespace.  As gcc 3.x fades into history, more code is using -fvisibility=hidden and attributes more aggressively, but the defaults make it really easy to do the wrong thing and dump a whole lot of symbols into the flat namespace.

There is one exception to this global calling: if you use dlsym to resolve a symbol from a specific dynamic library (as returned by dlopen) finds it in that dynamic library, like you would expect.  Therefore if you have a plugin with an "official" entry point (like "PluginStart") you can load multiple plugins into the global namespace and find the "right" start function via dlsym.  (If a plugin called its own start routine, it might jump into the wrong plugin due to globla namespace issues.

What Am I Exporting?

On both OS X and Linux you can use "nm" to view your globally exported symbols:
nm my_plugin.dylib | grep "T "
The stuff with a capital T from nm are code symbols in the global namespace.  If you make a plugin DLL that has a lot of those, your code may not operate if there are other plugins already loaded.

Static Libraries: Not So Static

In the Unix world, the .a (static archive) format is basically a collection of .o files with some header info to optimize when the code is linked.  .o files retain the hidden/visible attribute information that is used by the linker to export symbols out of a dynamic library.

What this means is: under normal operation, the linker may export dynamic library symbols out of a static library you link against. In other words, if you link against libpng.a, you may end up having your DLL export all of the symbols of libpng!  If you aren't the first dynamic library to load, the version of libpng you get may not be the static one you asked for.

This behavior is astonishing at best, but unfortunately it is, again, the default: if the static library didn't specifically set its symbols to hidden, you get "leakage" of static library symbols out of the client shared library.  Unfortunately, from my experience this kind of leakage happens all of the time.  With X-Plane we statically link libcurl, libfreetype and libpng, and all three have their symbols marked globally by default.  These are ./configure based libraries and we don't want to start second-guessing their build decisions.  Unfortunately the code tends to be marked up to build the right API in "shared library" mode but not static mode.

You can see this behavior using nm -m on OS X or objdump -t on Linux.

Working Around Library Leak

Someday we may reach a point where all Unix static libraries keep their symbols "hidden" for dynamic library purposes, but until then there is something a DLL can do to work around this problem: use an explicit list of symbol exports.

Using an explicit list of symbol exports is often considered annoying when an API has a large set of public entry points; usually attributes marking specific functions are preferred.  The advantage of an "official list" at link time is that the linker hides everything except that list, and if any static libraries have globally visible symbols, their absence from the master fixes the problem.

(As an example of how to set this up for gcc on Linux and OS X, see here.)

Addendum: What About Namespacing?

Both Linux and OS X have over time developed ways to cope with the flat namespace problem.

On OS X, a dynamic library can be linked with a two-level namespace.  The symbol is resolved against both the name of the providing dylib and the symbol itself. The result is that symbols come only from the dylibs where you thought they would come from.  If at link time symbol A comes from library X, library X is the only place where it will be provided in the future.  (This is the semantics Windows developers are used to.)

On Linux, library APIs can contain version information; as far as I can tell this works by "decorating" symbols with a named library version (e.g. @@GLIBC_2.0).  When the ABI is changed, symbols cannot conflict between versions, and in theory this may also protect against cross-talk between libraries since the version symbol has some kind of short universal library identifier.  I have found almost no documentation on library versioning; if anyone has a good Linux link I'll add it to this post.

X-Plane's plugin system does not use either of these mechanisms because the plugin system is older than both of them. (Technically on OS X two-level namespaces are older than the plugin system, but the plugin system is older than @loader_path, which is a requirement for strict linking of a dylib in the SDK.)  Thus we are stuck with the global namespace and find ourselves trying to force people to keep their symbols to themselves.

Sunday, December 09, 2012

Integrating LuaJIT with X-Plane: 64-bit fun

X-Plane supports a plugin SDK; plugins are dynamic libraries loaded by the sim into it's process address space.  X-Plane plugins link against a host library for access to the sim via a controlled API.

A number of popular plugins (Gizmo, SASL, FlyLua) provide Lua scripting to X-Plane add-ons using LuaJIT 2.0 as their Lua runtime engine.

LuaJIT on an x86_64 machine has an odd quirk: it requires all of Lua's allocations to be in the first 2 GB of address space.  This requirement comes from its use of offset addresses, which apparently have a signed 32-bit range.

Normally on OS X 64-bit the only special work you have to do is customize the zero page region, as OS X defaults the zero page to the entire bottom 4 GB of memory.  (I can only speculate that having the entire 32-bit address range be off limits helps catch programming errors where pointers are truncated to 32-bits because any truncated pointer will point to an illegal page.)

With X-Plane, however, we hit a snag: Lua scripts are often loaded by plugins late in the programs operation, when the user changes airplanes.  By this time, the OS has already consumed all of the "Lua-friendly" address space below 2 GB.  Looking at vmmap dumps, it looks like both the system malloc and OpenGL driver stack like this region of memory.

Here's how we 'fixed' this for 64-bit X-Plane:
  • The host sim pre-allocates as much of the 2 GB region as it can early on, in fixed size chunks.  Currently we're using 32 MB chunks, but this may change.
  • The host provides a custom Lua realloc function that is implemented using a hacked version of dlmalloc; dlmalloc uses the pool of pre-grabbed 32 MB chunks to form its pools.
  • Plugins use lua_newstate to connect the host's "chunk" allocator to the Lua runtime.
This last step required a modification to LuaJIT itself; the shipping 2.0 code has the API for custom allocators disabled in 64-bits (probably under the assumption that client code wouldn't follow the "bottom 2 GB" requirement.)  Fortunately the stub-out is just a series of #defines and the full API functionality is easily restored.

The 1.6 GB of virtual address space the sim grabs up front don't become actual memory allocations until plugins request the memory via Lua, because OS X is lazy about committing memory. Because the address space is pre-grabbed, VM allocates and mmaps that happen later in the app's life cycle simply go to higher address spaces.

There is one limitation to the implementation as it runs now: allocations larger than 32 MB will fail because the 'direct' allocation API in dlmalloc is not functional.  In theory the host allocator could search for consecutive 32 GB blocks to form larger allocations, but I suspect that in practice we'll simply set the chunk size larger.  Typical X-Plane based Lua scripts don't need huge amounts of memory (or at least it appears this way so far) and the number of separate blocks needed is based on the number of simultaneous Lua plugins running, which is also typically quite low.*

* By quite low I mean only one plugin at a time; as far as I can tell Lua plugins often export the runtime with global external symbols, which causes complete chaos.  That's the subject of another blog post.

Monday, November 19, 2012

Deferred Weirdness: What Have We Learned?

To paraphrase the famous Nike Tiger Woods ad, "...and, did you learn anything?"

X-Plane is more of a platform than a game - we support content that comes from third parties, is not updated in synchronization with the engine itself, is sometimes significantly older than the current engine code, and is typically created by authors who have limited or no access to the engine developers.

The easiest, quickest, most efficient way to make deferred rendering work is to set the rules first and customize the content creation process around them.  X-Plane's platform status makes this impossible, and most of the nuttiness in our deferred pipeline comes from bending the new engine to work with old content.

We have a number of design problems that are all "difficult" in a deferred engine:
  • A wide Z buffer range means we need to render twice.
  • Content is often permeated with alpha but still needs to be part of a deferred render.
  • Some content must be linearly blended, some must be sRGB blended, and there isn't any wiggle room.
Coping with any one of these problems is quite doable within a deferred renderer, but coping with all three at once becomes quite chaotic.  Having to maintain two blending equations (sRGB and linear) is particularly painful.

(This sits on top of the typical problems of being geometry bound on solid stuff and fill-rate bound on particles.)

So if there's a "learn from our fail"* in this, it might be: try to limit your design problem to only one "rough edge" in the deferred renderer if possible - it's not hard to put a few hacks into a deferred renderer but keeping several of them going at once is juggling.

* I should say: I'm just being snarky with "learn from our fail" - I consider our deferred rendering solution a success in that it provides our users with the benefits of deferred rendering (huge numbers of lights, better geometric complexity, soft particles, and there are other techniques we're just starting to experiment with) while supporting legacy content with very little modification.

The code is under stress because it does a lot, and while it's cute for me as a graphics programmer to go "oh my life would be easier if I could ignore alpha, color space, and go drink margaritas on the beach", the truth is that the code adds a lot of value to the product by provingg a lot of features at once.

I just don't want to have to touch it ever again. :-)

Sunday, November 18, 2012

Deferred Weirdness: When Not to be Linear

In my previous post, I described X-Plane 10's multipass scheme for deferred rendering.  Because we need to simultaneously operate in two depth domains with both deferred and forward rendered elements, the multi-pass approach gets very complicated, and much stenciling ensues.

But the real wrench in the machinery is sRGB blending.  It was such a pain in the ass to get right that I can safely say it wouldn't be in the product if there was any way to cheat around it with art.

Color Space Blending

First: when I say "sRGB" blending, what I mean is: blending two colors in the sRGB color space.  sRGB is a vaguely perceptual color space, meaning equi-distant numeric pixel values appear about equi-distant in brightness - if you saw a color ramp, it would look "even" to you.  It's not!  The sRGB stripe looks even to humans because we have increased visual sensitivity in the darker brightness ranges. (If we didn't, how could we stumble for the light switch and trip over the dog at night?)

In the sRGB world you mix red (255,0,0) and green (0,255,0) and get "puke" (128,128,0).

Linear color spaces refer to color spaces where the amount of photons or physical-whatever-thingies are linear.  Double the RGB and you get twice as many photons.  Linear spaces do not look "linear" to humans, see above about dark and dogs.

In linear space if you mix red (255,0,0) and green (0,255,0) you get yellow (186,186,0 if we translate back to sRGB space).

When to Be Linear

There are two cases where we found we just absolutely had to be linear:
  1. When accumulating light from deferred lights, the addition must be linear.  Linear accumulated lights look realistic, sRGB accumulated lights look way too stark with no ambiance.
  2. When accumulating light billboards, it turns out that linear also looks correct, and sRGB additive blending screws up the halos drawn into the billboards.
In both cases, linear addition of light gives us something really important: adding more light doesn't double the perceived brightness.  (Try it: go into a room, turn on one light, then a second.  Does the room's brightness appear to double?  No!)  For billboards, if a bunch of billboards "pile up" having the perceptual brightness curve taper off is a big win.

When to Be sRGB

The above "linear" blending cases are all additive.  Positive numbers in the framebuffer represent light, adding to them "makes more light", and we want to add some kind of linear unit for physically correct lighting.

But for more conventional "blending" (e.g. adding light on top takes away light underneath) linear isn't always good.

Traditional alpha blending is an easy for artists to create effects that would be too complex to simulate directly on the GPU.  For example, the windows of an airplane are a witches brew of specularity, reflection, refraction, absorbtion, and the BDRF changes based on the amount of grime on the various parts of the glass. Or you can just let your artist make a translucent texture and blend it.

But this 'photo composited' style blending that provides such a useful way to simulate effects also requires sRGB blending, not linear blending.  When an 'overlay' texture is blended linear the results are too bright, and it is hard to tell which layer is "on top".  The effect looks a little bit like variable transparency with brightness, but the amount of the effect depends on the background.  It's not what the art guys want.

Alpha When You Least Expect It

Besides the usual sources of "blended" alpha (blended meaning the background is darkened by the opacity of the foreground) - smoke particles, clouds, glass windows, and light billboards - X-Plane picks up another source of alpha: 3-d objects alpha fade with distance.  This is hugely problematic because it requires the alpha blending you get when you put alpha through the deferred renderer to be not-totally-broken; we can't just exclude all 3-d from the deferred render.

I mention this now because we have to ask the question: what blending mode are we trying to get in the deferred renderer (to the extent that we have any control over such things)?  The answer here is again sRGB blending.

Faking Alpha in a GBuffer

Can we fake alpha in our deferred renderer?  Sort of.  We can choose to blend (or not blend) various layers of the G Buffer by writing different alphas to each fragment data component; we can gain more flexibility by using pre-multiplied alpha, which lets us run half the blending equation in shader (and thus gives us separate coefficients for foreground and background if desired.

Fake alpha in a GBuffer has real problems.  We only get one 'sample' per G-Buffer, so instead of getting the blend of the lighting equation applied to a number of points in space, we get a single lighting equation applied to the blend of all of the properties of that point.  But blending is better than nothing.  We blend our emissive and albedo channels, our specular level, our AO, and our normal.  (Our normal map is Lambertian Azimuth equal area and it blends tolerably if you set your bar low.)

The only channel we don't blend for our blended fragments is eye position Z; even the slightest change to position reconstruction causes shadow maps to alias and fail - hell, shadow maps barely work on a good day.

The G-Buffer blending is all in sRGB - the albedo and emissive layers are 8-bit sRGB encoded.

Adding It All Up

The emission and albedo layers of the G-Buffer must be added in sRGB space.  This is not ideal (because emission layers contain light) but it is necessary.  Consider two layers of polygons being drawn into a G-Buffer.  The bottom is heavy on albedo, the top heavy on emissive texture.  As we "cross-fade" them with alpha, we are actually darkening the albedo and lightening the emission layer - two separate raster ops into two separate images.  This only produces a correct sRGB blend if we know that the two layers will later be added together in sRGB.  In other words:
is only equal to
if the blending and addition all happen in the same color space. The top equation is how "blended" geometry work in a forward renderer (albedo and emissive light summed in shader before being blended into the framebuffer) and the bottom equation is how a deferred renderer looks (blending done per layer and the light addition done later on the finished blend).

Once we add and blend in sRGB space in our deferred renderer a bunch of things do work 'right':
  • Alpha textures that can't be excluded from the deferred renderer that need sRGB blending work, as do alpha fades with distance.
  • We can mix & match our emissive and albedo channels the way we would in a forward renderer and not be surprised.
  • Additive light from spills is still linear, since it is a separate accumulation into the HDR framebuffer.
There is one casualty:
  • We cannot draw linear additive-blended "stuff" into the deferred renderer.
This last point is a sad casualty - looking at the 11-step train-wreck of MRT changes in the previous post, if we could draw linear-blended stuff into the deferred renderer (even if just by using framebuffer-sRGB) we could save a lot of extra steps.  But we would lose sRGB alpha blending for deferred drawing.
I will try to summarize this mess in another post.

Saturday, November 17, 2012

Deferred Weirdness: Collapsing Two Passes

X-Plane's deferred pipeline changed a lot in our 10.10 patch, into a form that I hope is final for the version run, because I don't want to have to retest it again.  We had to fix a few fundamental problems.

Our first problem was to collapse two drawing passes.  X-Plane needs more precision than the Z buffer provides.  Consider the case where you are in an airplane in high orbit, in your 3-d cockpit.  The controls are a lot less than 1 meter away, but the far end of the planet below you might be millions of meters away.  With the near and far clip planes so far apart (and the near clip plane so close) there's no way we avoid Z thrash.

X-Plane traditionally solves this with two-pass rendering.  Because an airplane cockpit is sealed, we can draw the entire outside world in one coordinate space, blast the depth buffer, and then draw the interior cockpit with reset near/far clip planes.  The depth fragments of the cockpit are thus farther than parts of the scenery (in raw hardware depth buffer units) but the depth clear ensures clean ordering.

(This technique breaks down if something from the outside world needs to appear in the cockpit - we do some funny dances as the pilot exits the airplane and walks off around the airport to transition, and you can create rendering errors if you know what to look for and have nothing better to do.)

The Dumb Way

So when we first put in deferred rendering, we did the quickest thing that came to mind: two full deferred rendering passes.

(Driver writers at NVidia: please stop bashing your heads on your desks - we're trying to help you sell GPUs!  :-)

Suffice it to say, two full deferred passes was a bit of a show-stopper; deferred renderers tend to be bandwidth bound, and by consuming twice as much of it as a normal, sane game, we were destined to have half the framerates of what our users expected.

No Z Tricks

Unfortunately, I didn't find a non-linear Z buffer approach I liked.  Logarithmic Z in the vertex shader clips incorrectly, and any Z re-encoding in the fragment shader bypasses early-Z optimizations.  X-Plane has some meshes with significant over-draw so losing early Z isn't much fun.

Particle + HDR = Fail

There was a second performance problem that tied into the issue of bandwidth: X-Plane's cloud system is heavy on over-draw and really taxes fill rate and ROPs, and in the initial pipeline it went down into an HDR surface, costing 2x the memory bandwidth.  So we needed a solution that would put particle systems into an 8-bit surface if possible.

One Last Chainsaw

One last chainsaw to throw into the mix as we try to juggle them: our engine supports a "post-deferred" pass where alpha, lighting effects, particles, and other G-buffer-unfriendly stuff can live; these effects are forward rendered on top of the fully resolved deferred rendering.  We have these effects both outside of the airplane and inside the airplane!

Frankenstein is Born

The resulting pipeline goes something like this:
  • We have a G-Buffer, HDR buffer, and LDR buffer all of the same size, all sharing a common Z buffer.  The G-Buffer stores depth in eye space in half-float meters, which means we can clear the depth buffer and not lose our G-Buffer resolve.
  • Our interior and exterior coordinate systems are exactly the same except for the near/far clip planes of the projection matrix.  In particular, both the interior and exterior drawing phases are the same in eye space and world space.
  1. We pre-fill our depth buffer with some simple parts of the cockpit, depth-only,with the depth range set to the near clip plane.  This is standard depth pre-fill for speed; because the particle systems in step 4 will be depth tested, this means we can pre-occlude a lot of cloud particles with our cockpit shell.
  2. We render the outside solid world to the G-Buffer.
  3. We draw the volumes of our volumetric heat blur effects, stencil only, to "remember" which pixels are actually exposed (because our depth buffer is going to get its ass kicked later).
  4. We draw the blended post-gbuffer outside world into the LDR buffer, using correct alpha to get an "overlay" ready for later use.  (To do this, set the alpha blending to src_alpha,1-src_alpha,1,1-src_alpha.)  This drawing phase has to be early to get correct Z testing against the outside world, and has the side effect of getting our outside-world particles into an LDR surface for performance.
  5. We draw our light billboards to our HDR buffer.
  6. We clear the depth buffer and draw the inside-cockpit solid world over the G-Buffer.  We set stenciling to mark another bit ("inside the cockpit") in this pass.
  7. We draw the heat markers again, using depth-fail to erase the stencil set, thus 'clipping out' heat blur around the solid interior.  This gets us a heat blur stencil mask for later that is correct for both depth buffers.  (Essentially we have used two stenciling paths to 'combine' two depth tests on two depth buffers that were never available at the same time.)
  8.  We go back to our HDR buffer and blit a big black quad where the stencil marks us as "in-cockpit".  This masks out exterior light billboards from step 5 that should have been over-drawn by the solid cockpit (that went into the G-Buffer).  This could be done better with MRT, but would add a lot of complexity to already-complex configurable shaders.
  9. We "mix down" our G-Buffer to our HDR buffer.  Since this is additive, light billboards add up the way we want, in linear space.
  10. We draw another stenciled black quad on our LDR buffer to mask out the particles from step 4.
  11. Finally, we render in-cockpit particles and lights directly into the LDR buffer.
Yeah.  I went through a lot of Scotch this last patch.

A few observations on the beast:
  • That's a lot of MRT changes, which is by far the weakest aspect of the design.  We don't ever have to multi-pass over any surface except the depth buffer, but we're still jumping around a lot.
  • The actual number of pixels filled is pretty tame.
  • Night lighting is really sensitive to color space, and we picked up a few steps by insisting that we be in exactly the right color space at all times. Often the difference between a good and bad looking light is in the 0-5 range of 8-bit RGB values!  When lights are rendered to a layer and that layer is blended, we have to be blending in linear color space both when we draw our lights and when we composite the layer later!
In particular, there's one really weird bit of fine print: while spill lights accumulate in our HDR buffer linearly (which is a requirement for deferred lighting), pretty much every other blending equation in the deferred engine runs in sRGB space.  That's weird enough that it still surprises me, it makes everything way more complicated than it has to be, and I will describe why we need sRGB blending in the next post.

Friday, November 16, 2012

Deferred Lighting: Stenciling is not a Win

I've been meaning to write up a summary of the changes I made to X-Plane's deferred rendering pipeline for X-Plane 10.10, but each time I go to write up an epic mega-post, I lose steam and end up with another half-written draft, with no clue about what I meant to say.  So in the next few posts I'll try to cover the issues a little bit at a time.

One other note from the pipeline work we did: using the stencil buffer to reject screen space for deferred lights is not a win in X-Plane.

The technique is documented quite a bit in the deferred rendering powerpoints and PDFs.  Basically when drawing your deferred lights you:
  1. Use real 3-d volumetric shapes like pyramids and cubes to bound the lights.
  2. Use two-sided stenciling to mark only the area where there is content within the light volume.  A second pass over the volume then fills only this area.
The stenciling logic is exactly the same as stencil-shadow volumes, and the result is that only pixels that are within the light volume are lit; screen space both in front and behind the volume are both rejected.

For X-Plane, it's not worth it.  YMMV, but in the case of X-Plane, the cost of (1) using a lot more vertices per light volume and (2) doing two passes over the light volume far outweigh the saved screen space.

For a few very pathological cases, stenciling is a win, but I really found myself having to put the camera in ridiculous place with ridiculously lopsided rendering settings to see a stencil win, even on an older GPU. (I have a Radeon 4870 in my Mac - and if it's not a win there, it's not a win on a GeForce 680. :-)

The cost of volumes is even worse for dynamic lights - our car headlights all cast spill and the light volume transform is per-frame on the CPU.  Again, increasing vertex count isn't worth it.

For 10.10 we turned off the stencil optimization, cutting the vertex throughput of lights from two passes to one.

For a future version will probably switch from volumes to screen-space quads, for a nice big vertex-count win.

Finally, I have looked at using instancing to push light volumes/quads for dynamic objects.  In the case of our cars, we have a relatively small set of cars whose lights are transformed a large number of times.  We could cut eight vertices (two quads per car) down to a single 3x4 affine transform matrix.

Again, YMMV; X-Plane is a very geometry-heavy title with relatively stupid shaders.  If there's one lesson, it's this: it is a huge win to keep instrumentation code in place.  In our case, we had the option to toggle stenciling and view performance (and the effect on our stat counters at any time.

Wednesday, October 31, 2012


I just finished integrating FXAA 3.11 into X-Plane 10; we were using an older FXAA implementation.  Timothy Lottes pointed me at the right way to combine OGSSAA and FXAA: run the FXAA sampler in SSAA space and mix down its results on the fly.  Conveniently our pipeline was pretty much ready to do this.

I've been looking for an option for OGSSAA other than 4x.  One idea would be to use 1.4x1.4 scaling for 2x total fill rate costs, blurry box filtering be damned, but FXAA really needs to know where the individual pixels are.

X-Plane's main problem is temporal anti-aliasing, and most of the anti-aliasing is vertical - that is, there are a lot of long thin horizontal features in the background (roofs of buildings, roads, etc.) that are responsible for the most annoying artifacts.

So I tried an experiment: non-square OGSSAA with FXAA. pretty much works.  I'm sure someone has done this before, and frankly I don't have the greatest eye for anti-aliasing, but the extra vertical res really improves image stability.

Secret decoder ring to the images: images with "FX" in the name (or use_post_aa=2) in the caption have FXAA applied.  The 2x/4x/8x applies to the OGSSAA sample count; the grid is shown in the pixels.  2x OGSSAA is 1x2, 4x is 2x2, and 8x is 2x4.

The pictures really don't do justice to the improvement that 2x4 gives the image in terms of temporal stability.  Having 4x the vertical samples for those thin roofs makes a big difference.

Tuesday, July 24, 2012

Deferred Depth: 3 Ways

I have been updating X-Plane 10's deferred pipeline; when 10.10 is in a later beta and we're sure that the new pipeline is going to hold, I'll write it up in more detail.  But for now, a few notes on deferred depth.

Our art guys need more properties in the G-Buffer, so I went looking to see if I could recycle the depth buffer, rather than waste a G-Buffer channel on depth.

The Problem

The problem is this: a deferred renderer typically wants to read depth to reconstruct eye-space position in the lighting pass; position is needed for lighting attenuation and shadows.

But the lighting pass almost certainly also needs the depth buffer to be bound for depth rejection.  There are a number of cute G-Buffer tricks that require this
  • Any kind of depth-based light rejection (whether with a full stencil volume or just based on the bounding volume being occluded) requires the depth buffer of the scene as a real Z buffer.
  • Soft particles require rejecting Z and sampling (to soften).  We really want the hardware Z buffer to cut fill rate!

Copying the Depth Buffer

One simple approach is to copy the depth buffer to a depth texture.  In my case, I tried copying the current depth buffer (which is D24/S8) to a GL_depth_component24 texture using glCopyTexSubImage.  This worked well and didn't cause performance problems; I guess after a number of years ripping the depth buffer to a texture has finally been ironed out.

With this technique the eye space layer of the G-buffer is another texture, but it comes from a single full-screen copy rather than from an additional MRT color attachment.

Read And Use

A second approach is to simply bind the depth buffer and use it at the same time.  This scheme requires GL_NV_texture_barrier (an extension I didn't know about until smarter people clued me in recently) and thus is only available on Windows.  In this scheme you:
  • Set up your D24/S8 depth buffer as a texture attached to depth in your main G-Buffer FBO, rather than a render-buffer.  Non-POTS textures is a given for DX10 hardware.
  • Share this depth buffer with your HDR texture that you "accumulate" light into.
  • After completing the g-buffer fill pass, call glTextureBarrierNV() to ensure that all write-outs to the depth buffer have completed before the next thing happens.
  • Turn off depth writing and Z-test and read from the depth buffer at the same time (something that is allowed by the relaxed semantics of the extension).
This saves us the copy and extra VRAM, but assumes that our various post-processing effects don't need to write Z, an assumption that is usually true.

I have not tried this technique; see below for why sharing Z isn't for X-Plane.

Eye-Space Float Z

One simple way to solve the problem (the one X-Plane originally used, and one that is sometimes used on older platforms that won't let you read and sample Z at the same time) is to simply write eye-space Z to part of the G-Buffer.  This wastes G-Buffer space, but... is unfortunately necessary for X-Plane.  X-Plane draws in two depth domains, one for the world and one for the 3-d cockpit.  Thus no one Z buffer contains full position information for the entire screen.  In order to avoid two full mix-downs of the G-Buffer, we simply write out eye-space position in floating point, which gives us adequate precision over the entire world.*

If I could find a depth encoding that clips properly, doesn't inhibit early Z, and can span the entire depth range we need, we could use one of the techniques above, but I don't think such a Z technique exists, as X-Plane needs a near clip plane of around 1-5 cm in the cockpit and at least 100k meters to the far clip plane.  With D24S8 we're off by quite a few bits.

(I have not had a chance to experiment with keeping the stencil buffer separate yet.)

*Currently the code uses 16-bit float eye space depth, which is apparently faster to fill than 32F on ATI hardware according to some presentation I found.  Is it enough precision?  I am not sure because I will have to fix other shadow bugs first.  But it should be noted that we care a lot about near precision but really not much about far depth, which is only used for fog.  If a later post says we use a 32F eye-space Z and not 16F, you'll know what happened.

Wednesday, April 04, 2012

Beyond glMapBuffer

For a while X-Plane has had a performance problem pushing streaming vertices through ATI Radeon HD GPUs on Windows (but not OS X).  Our initial investigation showed that glMapBuffer was taking noticeable amounts of time, which is not the case on other platforms.  This post explains what we found and the work-around that we are using.

(Huge thanks to ATI for their help in giving me a number of clues!  What we did in hindsight seems obvious, but with 536 OpenGL extensions to choose from we might never have found the right solution.)

What Is Streaming?

To be clear on terminology, when I say streaming, I mean a vertex stream going to the GPU that more-or-less doesn't repeat, and is generated by the CPU per frame.  We have a few cases of these in X-Plane: rain drops on the windshield, car headlights (since the cars move per frame, we have to respecify the billboards every frame) and the cloud index buffers all fit into this category.  (In the case of the clouds, the Z sort is per frame, since it is affected by camera orientation; the puff locations are static.)

In all of these cases, we use an "orphan-and-map" strategy: for each frame we first do a glBufferData with a NULL ptr, which effectively invalidates the contents of the buffer at our current time stamp in the command stream; we then do a glMapBuffer with the write flag.  The result of this is to tell the driver that we want memory now and we don't care what's in it - we're going to respecify it anyway.

Most drivers will, in response to this, prepare a new buffer if the current one is still in-flight.  The effect is something like a FIFO, with the depth decided by the driver.  We like this because we don't actually know how deep the FIFO should be - it depends on how far behind the GPU is and how many GPUs are in our SLI/CrossFire cluster.

Why Would Streaming Be Expensive?

If orphaning goes well, it is non-blocking for the CPU - we keep getting new fresh buffers from the driver and can thus get as far ahead of the GPU as the API will let us.  So it's easy to forget that what's going on under the hood has the potential to be rather expensive.*  Possible sources of expense:
  • If we really need a new buffer, that is a memory allocation, by definition not cheap (at least for some values of the word "cheap").
  • If the memory is new, it may need a VM operation to set its caching mode to uncached/write-combined.
  • If the driver doesn't want to just spit out new memory all of the time it has to check the command stream to see if the old memory is unlocked.
  • If the driver wants to be light on address space use, it may have unmapped the buffer, so now we have a VM operation to map the buffer into our address space.  (From what I can tell, Windows OpenGL drivers are often quite aggressive about staying out of the address space, which is great for 32-bit games.  By comparison, address space use with the same rendering load appears to always be higher on Mac and Linux.  64 bit, here we come!)
  • If orphaning is happening, at some point the driver has to go 'garbage collect' the orphaned buffers that are now floating around the system.
Stepping back, I think we can say this: using orphaning is asking the driver to implement a dynamic FIFO for you.  That's not cheap.

Real Performance Numbers

I went to look at the performance of our clouds with the following test conditions:
  • 100k particles, which means 400k vertices or 800k of index data (we use 16-bit indices).
  • 50 batches, so about 2000 particles per VBO/draw call.
  • The rest of the scenery system was set quite light to isolate clouds, and I shrunk the particle size to almost nothing to remove fill rate from the calculation.
(In this particular case, we need to break the particles into batches for both Z sorting and culling reasons, and the VBOs aren't shared in a segment-buffer-like scheme due to the scrolling of scenery.)
Under these conditions, on my i5-2500 I saw these numbers.  The 0 ms really means < 1 ms, as my timer output is only good +/- 1 ms.  (Yes, that sucks...the error is in the UI of the timer, not the timer itself.)
  • NV GTX 580, 296.xx drivers: 2 ms to sort, 0 ms for map-and-write, 0 ms for draw.
  • ATI Radeon 7970 12-3 drivers: 2 ms to sort, 6 ms to map, 1 ms to write, 1 ms to plot.
That's a pretty huge difference in performance!  The map+write and draw is basically free on NV hardware, but costs 8 ms on ATI hardware.

glMapBufferRange Takes a Lock

In my original definition of streaming I describe the traditional glBufferData(NULL) + glMapBuffer approach.  But glMapBufferRanged provides more explicit semantics - you can  pass the GL_MAP_INVALIDATE_BUFFER_BIT flag to request a discard-and-map without calling glMapBuffer.  What surprised me is that on ATI hardware this performed significantly worse!
It turns out that as of this writing, on ATI hardware you also have to pass GL_MAP_UNSYNCHRONIZED_BIT or the map call will block waiting on pending draw calls.  The more backed up your GPU is, the worse the wait; while the 6 ms of map time above is enough to care a lot, blocking on a buffer can cut your framerate in half.
I believe that this behavior is unnecessarily conservative - since I don't see buffer corruption with unsynchronized + invalidation, I have to assume that they are mapping new memory, and if  that is the case, the driver should always return immediately to improve throughput.  But I am not a driver writer and there's probably fine print I don't understand.  There's certainly no harm in passing the unsynchronized bit.

With invalidate + unsynchronized, map-buffer-range has the same performance as bufferdata(NULL)+map-buffer.  Without the unsynchronized bit, map-buffer-range is really slow.

Map-Free Streaming with GL_AMD_pinned_memory

Since glMapBufferRange doesn't do any better when using it for orphaning, I tried a different path: GL_AMD_pinned_memory, an extension I didn't even know existed until recently.

The pinned memory extension converts a chunk of your memory into an AGP-style VBO - that is, a VBO that is physically resident in system memory, locked down, write-combined, and mapped through the GART so the GPU can see it.  (In some ways this is a lot like the old-school VAR extensions, except that the mapping is tied to a VBO so you can have several in flight at once.)  The short of it:
  • Your memory becomes the VBO storage.  Your memory can't be deallocated for the life of the VBO.
  • The driver locks down your memory and sets it to be uncached/write-combined.  I found that if my memory was not page-aligned, I got some really spectacular failures, including but not limited to crashing the display driver.  I am at peace with this - better to crash hard and not lose fps when the code is debugged!
  • Because your memory is the VBO, there is no need to map it - you know where it's base address is and you can just write to it.
This has a few implications compared to a traditional orphan & map approach:
  1. In return for saving map time, we are losing any address space management.  The pinned buffer will sit in physical memory and virtual address space forever.  So this is good for streaming amounts of data, but on a large scale might be unusable.  For "used a lot, mapped occaisionaly" this would be worse than mapping.  (But then, in that case, map performance is probably not important.)
  2. Because we never map, there is no synchronization.  We need to be sure that we are not rewriting the buffer for the next frame while the previous one is in flight.  This is the same semantics as using map-unsynchronized.
On this second point, my current implementation does the cheap and stupid thing: it allocates 4 segments of the pinned buffer, allowing us to queue up to 4 frames of data; theoretically we should be using a sync object to block in the case of "overrun".  The problem here is that we really never want to hit that sync - doing so indicates we don't have enough ring buffer.  But we don't want to allocate enough FIFO depth to cover the worst CrossFire or SLI case (where the outstanding number of frames can double or more) and eat that cost for all users.  Probably the best thing to do would be to fence and then add another buffer to the FIFO every time we fail the fence until we discover the real depth of the renderer.
With pinned memory, mapping is free, the draw costs about 1 ms and the sort costs us 2 ms; a savings of about 6-7 ms!

Copying the Buffer

The other half of this is that we use glCopyBuffer to "blit" our pinned buffer into a static-draw VRAM based VBO for the actual draw.  Technically we never know where any VBO lives, but with pinned memory we can assume it's an AGP buffer; this means that we eat bus bandwidth every time we draw from it.

glCopyBufferData is an "in-band" copy, meaning that the copy happens when the GPU gets to it, not immediately upon making the API call.  This means that it won't block because the previous draw call that uses the destination buffer hasn't completed.

In practice, for our clouds, I saw better performance without the copy - that is, drawing vertices from AGP was quicker than copying to VRAM.  This isn't super-surprising, as the geometry gets used only twice, and it is used in a very fill-rate expensive way (drawing alpha particles).**  We lost about 0.5 ms by copying to VRAM.

Sanity Checks

Having improved cloud performance, I then went to look at our streaming light billboard and streaming spill volume code and found that this code was mistuned; the batch size was set low for some old day when we had fewer lights.  Now that our artists have had time to go nuts on the lighting engine, we were doing 5000 maps/second due to poor bucketing.

For that matter, the total amount of data being pushed in the stream was also really huge.  If there's a moral to this part of the story it is: sometimes the best way to make a slow API fast is to not use it.

Better Than Map-Discard

Last night I read this Nvidia presentation from GDC2012, and it surprised me a little; this whole exercise had been about avoiding map-discard on ATI hardware for performance reasons - on NVidia hardware the driver was plenty fast.  But one of the main ideas of the paper is that you can do better than map-discard by creating your own larger ring buffer and using a sub-window.  For OpenGL I believe you'd use unsynchronized, discard-range, and the write flags and map in each next window as you fill it.

The win here is that the GPU doesn't actually have to manage more than one buffer; they can do optimal things like just leave the buffer mapped for a while or return scratch memory and DMA it into the buffer later.  This idiom is still a map/unmap though, so if the driver doesn't have special optimization to make map fast, it wouldn't be a win.

(That is, on ATI hardware I suspect that ring-buffering your pinned VBO is better than using this technique.  But I must admit that I did not try implementing a ring buffer with map-discard-range.)

The big advantage of using an unsynchronized (well, synchronized only by the app) ring buffer is that you can allocate arbitrary size batches into it.  We haven't moved to that model in X-Plane because most of our streaming cases come in large and predictable quantities.

* In all of these posts, I am not a driver writer and have no idea what the driver is really doing.  But on the one or two times I have seen real live production OpenGL driver code, I have been shocked by how much crud the driver has to get through to do what seems like a cheap API call.  It's almost like the GL API is a really high level of abstraction!  Anyway, the point of this speculation is to put into perspective why, when confronted with a slow API call, the right response might be to stop making the call, rather than to shout all over the internet that the driver writers are lazy.  They're not.

** When drawing from AGP memory, the rate of the vertex shader's advancing through the draw call will be limited by the slowest of how fast it can pull vertex data over the AGP bus and how fast it can push finished triangles into the setup engine.  It is reasonable to expect that for large numbers of fill-heavy particles, the vertex shader is going to block waiting for output space while the shading side is slow.  Is the bus idle at this point?  That depends on whether the driver is clever enough to schedule some other copy operation at the same time, but I suspect that often that bus bandwidth is ours to use.

Monday, February 27, 2012

Confessions of a Lisp Hater

I hate to admit this, but sometimes I find myself missing python. For example, the other day, I wrote this horrible mess^H^H^H^H^H^Hbeautiful example of templated C++:
CollectionVisitor<Pmwx,Face_handle,UnlinkedFace_p> face_collector(&friends, UnlinkedFace_p(&io_faces));
VisitAdjacentFaces<Pmwx, CollectionVisitor<Pmwx,Face_handle,UnlinkedFace_p> >(*f, face_collector);
Once you start writing really well-factored C++ template functions, eventually you'll reach this point: the verbosity of code is dominated by having to write a ton of function objects because C++ doesn't have closures. (Yes, closures. No one is more pained to admit envy of a Lisp feature than me.)

What makes the above code particularly ugly isn't just, um, that mess, but the other shoe - the UnlinkedFace_p predicate is several lines of boiler plate for about one expression of useful C++:
struct UnlinkedFace_p {
set<Face_handle> * universe_;
UnlinkedFace_p(set<Face_handle> * universe) : universe_(universe) { }
bool operator()(Face_handle who) const {
return universe_->count(who) > 0;
(Note: I am trying to get the less than and greater than symbols manually converted so Blogger doesn't delete 3/4 of my templated code like it has in the past.)

What makes this function so verbose is that the "closure" - that is, the act of copying the locally scoped data "into" the predicate object has to be done by hand for every predicate we write, and the predicate can't be located in the function where it really belongs. (At least, if I try to locate my struct inline, GCC becomes quite cross. There may be fine print to get me closer to predicates inline with the code.)

In Python this would be a lot more pleasant. There's no typing, so strip out all of that type-declaring goo, and we have real closures, so the entire function chain can be written as one nested set of calls that contain pretty much only the good bits.

This got me thinking about the difference between coding in C and Python. Here comes another stupid coding quote:
Coding in C is like going out to dinner with someone who's really cheap and insists on discussing the price of everything on the menu. "You know, that pointer dereference includes a memory access - 12 to 200 cycles. That ? involves conditional code, you might mis-predict the branch. That function is virtual - you're going to have two dependent memory fetches." You know the cost of everything on the menu. 
Coding in Python is like going on a shopping spree with your friend's credit card. "Go ahead, iterate through a list of lists, and insert in the middle. If it feels right, just do it. No, it's not expensive, why do you ask?"

Friday, February 17, 2012

XCode, Lion, GCC Oh My!!

My curiosity always pushes me to upgrade to the latest OS despite the caution issued to me by Ben. I upgraded to Lion and quickly found that XCode 3.2.6 cannot be installed on Lion. We're not ready to move completely to XCode 4 just yet so I had to find a work around. Luckily, there is one.

Basically, you just have to follow these simple steps:
  • Mount the Xcode 3.2.6 DMG
  • Open Terminal
  • Enter the commands:
    open "/Volumes/Xcode and iOS SDK/Xcode and iOS SDK.mpkg"
Because I need to run XCode 4 and XCode 3.2.6 side by side, I installed XCode 3.2.6 to /Developer-3.2.6 and let XCode 4 have the default /Developer directory.

Ok that gets me a working copy of XCode 3.2.6 as well as the SDKs that I need which are 10.5 and 10.6...but I'm not out of the woods yet because the X-Plane Scenery Tools requires the 10.5 SDK. I have it so there's no problem right? WRONG! The GCC being used by the system is in /Developer and if you take a look at /Developer/SDKs you'll see 10.6 and 10.7 but no 10.5. That's because 10.5 lives in /Developer-3.2.6 but GCC won't be looking there. My solution was to create a symbolic link (the linux kind ln -s, not the Mac Alias kind) in the /Developer/SDKs folder that points to the /Developer-3.2.6/SDKs folder. The actual command was:

ln -s /Developer-3.2.6/SDKs/MacOSX10.5.sdk /Developer/SDKs/MacOSX10.5.sdk

So now with that done, GCC now has access to the right SDKs. The finder view of /Developer/SDKs should look like this:

At this point, I tried to compile the libs necessary for the X-Plane Scenery Tools again but was hit with one more snag as it was failing to find bits/c++config.h as referenced by iostream.h and a dozen other files. It turns out that the new GCC is a darwin11 version of the compiler. That's fine but if you look at /Developer/SDKs/MacOSX10.5.sdk/usr/include/c++/4.2.1/ you'll notice that there are several folders that are named with respect to the GCC version they're designed for...the ones we care about are:
  • i686-apple-darwin*
  • x86_64-apple-darwin*
Here's a screenshot of what it looks like on my system:

You'll notice that there's no version for darwin11 which is why the headers cannot be found. The solution once again is a symbolic link. You'll notice that darwin8 and darwin10 are actually really links to darwin9. All that I had to do now is create a symbolic link back to version 9 for the 32bit and 64bit paths. Here's the commands:

ln -s i686-apple-darwin9 i686-apple-darwin11
ln -s x86_64-apple-darwin9 x86_64-apple-darwin11

After these steps, the universe seemed to be back in order again. XCode 4 and XCode 3.2.6 co-exist properly and the scenery tools compile using the newer version of GCC even using the older 10.5 SDK.