Tuesday, December 23, 2008

Compressed Vectors - Part 2

My last post introduced an STL vector that uses RLE to save space and an index for reasonably fast random access.

But that's not the whole story.  We're trying to compress the "snapshot structure".  Those familiar with the lowest level of X-Plane multiplayer networking know that X-Plane uses a packed structure to describe one "frame" of the sim - that is, plane location, control surface deflections, needles, etc.  It is very heavily packed (both with and without loss, depending on the parameter).

The flight replay is essentially a giant vector of these structures.  So...how to utilize our compressed vector (which only compresses identical data)?

The answer is the plane_old_data compressed vector.  Basically this vector relies on the fact that its items are plain old data (POD) and stores each struct across many vectors (each usually containing 4-byte int or 8-byte long longs for items).  Call each of these vectors that covers a sequence of cross sections of our struct "track".  (Those who have used ProTools will understand why I say this.)

Now we can use our compressed vector for each track.  The result is that consecutive values over time at the same offset in our struct get compressed.  If the struct is filled with parameters (some of which vary, some of which don't), we get to compress the ones that do.

Once we get this structure integrated, we'll see what kind of packing efficiency we get.

Monday, December 22, 2008

Compressed Vectors - Part 1

Normally for data that has a lot of repeating values, I use run-length encoding - it works particularly well when the data is accessed from start to end in forward order.  For example, the spot lighting masks for X-Plane's 2-d panels are RGBA image that are RLE compressed. Continuous scan lines of transparent window become a single RLE line segment.  This actually makes processing faster, because the decision to skip transparent areas is made once per scan line, not once per pixel.

Things get a little bit trickier if we need random access to the data.  The structure I am playing with now provides something similar to that.  It is a vector where continuous values are stored once in runs.  It has two parts:
  • A run index (stored as a sorted vector of key-value pairs) maps the beginning of a run (in the virtual data) to the start of storage (in a real data vector).
  • A real data vector contains each value as few times as possible.
We don't need to store "how long is this run" in the data vector - the index gives it to us.  We just look at the next index key-value pair and see (1) how long is our run and (2) how much data is behind it.  If only one value backs our 5-value run, we are run-length compressed.

Complexity properties: call N the number of items, U the number of unique items, and R the number of runs (that is, the number of times we change from RLE to non-RLE data.
  • In the worst case (totally non-sequential data), the storage is the same as vector: O(N).  In the best case (all the same) we are O(1).  In both cases, R=1.
  • Generally our storage is O(R+U), which should be less than O(N).
  • Random access is O(log(R)) - not quite constant time, but pretty fast.  For the worst case random data, R is 1, so we can be fast that way.
  • Pushing data onto the end is constant time plus any cost of memory reallocation; since the vector is compressed, hopefully that is less.
  • Deleting data from the front is O(R+U) instead of O(N), which is a bit of an improvement.
At this point the unit tests work, but I don't have stats on real-world compression.

Sunday, December 21, 2008

Skinning MedaWiki

We use MediaWiki for the X-Plane Wiki, which is meant to be a collection of online documentation for tech support, third party editing, etc.

After pinging some users on my blog, the consensus seemed to be that while some people liked the interactivity of a Wiki, the usability of the site wasn't as good as "normal" doc websites (e.g. made with HTML, php, and some CSS).  I can see why they said that - the regular Wiki monobook has a lot of navigation space dedicated to "wiki-like" activities (e.g. editing and maintaining) and very little dedicated to search and navigation.

After looking at my own php rolled for the scenery site and MediaWiki, it became clear that it would be easier to make MediaWiki pretty than to make my php interactive.  So I built a new skin.  This is what I learned (the hard way):
  • In hindsight, using Monobook as a starting point was really, really stupid.  My goal was a very different layout from MonoBook, so I probably could have saved time by starting with a simpler skin.
  • On the other hand, MonoBook's php does output layout elements in a relatively clear way, so if you need to move things around on screen by changing the document hierarchy (as opposed to using a ton of CSS) it might be easier to start with MonoBook and move sections around.
  • Probably the smartest thing I could have done would have been to delete the whole style sheet and define every DIV block as having a border, padding and margin, so I could visualize the document layout.  What I did instead (slowly beat the style sheet into submission) took a lot longer.
Here are the old and new layouts.  The main ideas were:
  • Eliminate the side bar.  Important side bar items like search and the wiki name go up top; less important side bar items go to the bottom.
  • Use less styling - monobook has a ton of horizontal rules and borders all over the place. X-Plane Wiki content is often complex enough that readers don't need eye distractions. Most of the page is dedicated to document content.
  • Category breadcrumbs are enabled, and moved to the top, with much smaller style-sheet elements.  This provides "hierarchical" navigation (assuming the Wiki's category system is maintained).
The next step will be to redo the main portal pages (particularly main_page, which was never properly built) to give users a clear idea of where they need to go.

Monday, December 15, 2008

Transform Normals Directly

EDIT: Technique 2 (origin differencing) is, besides a terrible idea, not the fastest way to transform normals. See this post for a much better treatment of the subject!

There are two ways to transform a normal vector N given a matrix M:
  1. Compute the inverse of M, transpose it, and use that new M' to transform the normal vector.
  2. Transform M directly, and transpose the origin (0,0,0) then subtract the transformed origin form the transformed normal. This is like transforming the two end points of the normal.
Which is better? Well, I'd say it depends.
  • If you are transforming a lot of normals, calculate the inverse and transpose it once. Now you can invert normals directly.
  • If you are transforming only one normal, it might be cheaper to transform two points rather than invert a 4x4 matrix.
But...there is a danger to the "transform the points" technique: loss of precision!

If your matrix contains a large translation, then the result of transforming the normal and origin will be two points that are close together, but far the origin. This means that you have lost some precision, and subtracting won't get it back!

The inverse/transpose method does not have this problem; the transpose moves the translation part of the matrix off into the bottom row where we don't really care about it - the remaining matrix terms should have fairly small magnitudes and not lose precision.

Saturday, December 13, 2008

Fun With Übershaders

An Übershader (üshader) might mean different things to different developers.  A üshader always implies a single unified shader with a lot of capabilities, but how this is implemented could fall into two camps:
  1. Flow control in the shader.  This is good for an application that has poor OpenGL state sorting, because you can now draw everything with one shader and let conditionals sort it out.  But ... conditionals are not very fast (and in some hardware, not very existent), which leads to the other option...
  2. Create smaller specialized shaders by pre-processing.  This essentially is a loop-unrolling of option 1 - if an app has good state sorting, it makes sense to produce a specialized small shader for each case.
X-Plane uses option 2 - we support a lot of first generation shader hardware (no conditionals) and we were able to leverage existing state-sorting code (designed to avoid texture binds).  For the rest of this discussion, üshaders assumes implementation two, the mechanical production of many specialized small shaders from one larger piece of code.

In our case, the üshader accomplishes a number of nice things:
  • We wrap it in an abstraction layer, so that all graphics hardware looks "the same" to X-Plane - we even have a fixed function implementation of the üshader interface.  (This means: we have to live with whatever simplifications the fixed-function implementation makes - it is very hard for client code to know how its drawing request is "dumbed down". But we can live with this.)
  • The abstraction layer contains the OpenGL setup code for the shader, which keeps the huge pile of uniform setups and binds in one place.  One problem with GLSL development is that you end up with two separate chunks of code (the GL setup and the GLSL shader) that can't be out of sync.  The üshader keeps the GLSL in one place, and the abstraction wrapper keeps the GL setup in one place.
  • The abstraction layer presents a very orthogonal view of features, which helps us build a rendering engine that "does the right thing", e.g. if basic terrain can have cloud shadows, then two-texture terrain, runway terrain, road surface terrain, and every other type can also have cloud shadows.  Given the ability to plug user-authored content into the sim, we can't predict what combinations we might need to support.
  • The abstraction layer can build the shaders lazily, which keeps the number of shaders we keep around down to sane limits.  This is slightly frightening in that we don't (normally) generate every possible shader - conceivably there could be GLSL compile errors in certain odd combinations.
GLSL Optimization

We must consider the question: does this technique produce good GLSL shaders?  Well, it clearly produces better shaders than simply building a non-specialized üshader, which would contain a lot of extra crud.  The big concern for a üshader would be that the different parts of the shader, when glued together, would not produce optimal code.

So far I have found this to be not a huge problem.  GLSL compilers have come a long way - it appears to me that the days of having to hand-roll GLSL to optimize are over.  A year or two ago, I would be combining expressions to improve code performance; now I use temporaries and let the compiler figure it out.  That's enough compiler smarts to glue shader parts together as long as they really are necessary.

There are some cases that the compiler can't optimize that cause the conditionalized code to become overly complex.  For example, if my shader outputs a varying and the next shader down the line doesn't use it, this is discovered at link time, so I don't expect to get any dead-code elimination.

There is one way to produce substantially better shader code, and that is to recognize when parts of the shader features are not necessary.  For example, when we are doing a depth-only shadow render, we don't need to do any texture reads if the texture doesn't have an alpha channel.

One way to achieve this is to use a "feature optimizer" - that is, a specific piece of code that examines the naive requested shader state, and converts it to the true minimum required state (e.g. turning off texturing for non-alpha textures during depth-only shading).  Given clients of the shader code all over the app, having one optimizer as part of the shader interface means consistent application of optimization all the time.

GL Optimization

Perhaps more important than GLSL optimization is GL state optimization - making sure that we change as little as possible between batches for fastest processing.  On the fastest machines, users need to increase "object count" (replicated objects whose meshes are resident in VRAM) to really utilize their GPU's power.  At that point, state change becomes the critical path, even on a very fast machine.

Unfortunately, it can be difficult to tell what set of state transitions the renderer will have to make - it depends a lot on culling, as well as user-generated content.

One way to improve state change is to keep local caches of commonly changed state to avoid calls into the driver.

A more aggressive optimization is to provide "specific transition" functions for the shader.  So as well as an API to "set up the shader with these settings", we have APIs that change specific commonly changed, easy to recognize small aspects of the shader, ideally using an optimized path.  For example: often in an airplane we change just one texture (but no other state), and we know we are only changing the texture.  We can avoid a lot of thrash by recognizing this and using an optimized code path.

Clash of the Optimizers

Here's where things get ugly: can we still recognize a fast transition situation when there is an optimizer trying to alter GL state?  Naively, the answer is no.  For example, consider the case where we eliminate texture read for no-alpha textures.  If we now want to do a fast "texture only" change, we can only do this optimally if both textures have the same alpha channel (or lack thereof) - and in fact, the optimal case is to do absolutely nothing if both textures do not have alpha.

Thursday, December 11, 2008

Pinching Cascading Shadow Maps

The battle with cascading shadow maps (CSM) is always one for resolution.  We could use 25 layers of CSM, but that would defeat the whole purpose of CSM.  Ad-Hoc shadow maps deliver really good, precise shadows, but with two draw-backs:
  • They are horribly dependent on the size of the objects in your world.  For small objects they produce crisp shadows - for big ones they produce muck.
  • Odds are the number of objects floating around (cars, buildings, etc.) is several orders of magnitude larger than the number of CSM layers you might use.  I get good results with 8 CSM layers, and can probably reduce that with careful optimization.  I usually have more than 8 buildings on screen.  (That means a lot of thrash as we create, then use each shadow map, with GL setup each time.)
Yesterday I found (by misunderstanding the NVidia White Paper) a way to squeeze a little bit more detail out of my CSM layers.

For each "layer" (that is, a distance-wise partition of the user's frustum that gets a separate shadow map) we normally calculate the shadow map's bounding cube around the corners of the user's view sub-frustum (for that layer).

But that's really a much bigger bounding box than we need.  For the price of an iteration over the scene graph* we can calculate a smaller bounding box that is "pinched in" to the edge of the content we need to draw.

Is this a win?  Well, it depends...on...
  • The size of your scenery content.  You won't get a pinch in that's much smaller than the smallest indivisible entities in your world.
  • The overall shape of your scenery content - if it really fills the frustum, no win.
Evaluating this on scenery, I am seeing:
  • Very little benefit for the nearest layers...they are usually so small (a few dozen meters) that they include much larger scenery entities.  (Our ground patches can be several kilometers, so no win.)  But for the far layers, we might reduce the bounding box by 25-50% in some dimensions...that's almost like doubling your texture!
  • The shape of our content is a win.  Since the world is sort of flat and 2-dimensional, usually at least one axis of the bounding box (depending on the sun's angle) is horribly wasted.  That's where we get a win.
Example: the user's camera is on the ground, facing north.  The sun is directly overhead.  Now the user's frustum indicates that, in the farther layers, content that is significantly below or above the ground would be visible.  (We don't have occlusion culling here.)

But in practice, there is no scenery below the ground.  So we can "pinch in" the far clip plane of the sun's camera (which is really far from the sun, just in case anything below the surface of the earth is visible), bringing that far clip plane all the way up to the lowest ground point. Similarly, if we're not shadowing clouds (they are handled separately) the near clip plane can be pushed down to the tallest building.

This makes the far layers much, much more useful.  Normally if the layer is 10 km away at 60 degrees FOV, the bounding box for the shadow map is going to have 10000 meters from its near to far plane.  If we "pinch in", we can reduce this "depth of field" to the difference between the lowest and highest scenery items, which might only be 100 or 200 meters.

(This is of course a best-case scenario...put a mountain in there and angle the sun a bit and the win is much more modest.)

As a side effect, the scene graph traversal lets us completely eliminate layers that contain no content - I am finding I drop at least one layer that way.

EDIT: all of the above data is totally misleading - for shadowing 3-d content, the above is true. But if the terrain mesh is included (it is much larger, and its granularity is larger), the savings all vanish.

* Lazy evaluation can make this a lot faster - simply skip whole sub-trees of the scene graph that are already entirely inside the "pinched" cube)

Wednesday, December 10, 2008

Oblique Frustum Culling is Cool

Every now and then I come across a graphics algorithm so out there that it just blows my mind - it feels like cheating the laws of physics or something.  Oblique Frustum Culling is one of those algorithms.

Here's the situation: to make a pretty water reflection, you need to reflect the camera position around the water plane - take a picture, and then use that projected texture later to draw the water reflections.  There's one hitch: you have to clip out everything below the water.  If you don't, you get this.

The inset picture is the "from-the-water" view...note that the pillars of the bridge go way below the water (so one art asset can be used at many elevations).  But since they are not obscured by the water (when viewed from below the water), they show up in the reflection texture as the dark areas behind the bridges.  (This is because the reflection texture's projection doesn't know what "forward" and "backward" are along the projected light ray.)

You can't fix this using depth buffering or by drawing the water - the water would occlude everything you do want.  So you need a user-specified clip plane.  Clip everything below the water and go home happy, like this.

As I have blogged before, clip planes and shaders don't always work well together.  (And by don't always, I mean never.)  If I remember correctly, one vendor requires you write the clip vertex while the other requires you don't.  To make it more fun, the latest GLSL specs will deprecate the clip vertex entirely in terms of gl_ClipDistance (which makes more sense actually).  Mix that with drivers for OS X and for Win/Lin, and users who aren't on the latest GLSL compilers, and you have enough anarchy to want to kill yourself.

That's where oblique frustum culling comes in.  Basically it's an algorithm that takes the near clip plane of the view frustum and distorts/bends/twists/performs voodoo on it so that it is your user clip plane.  Quite literally everything below water is now "too close" to the camera.

What I really love about this algorithm is: no extra state - we don't have to have conditionals in our shaders for the (rare) time we need our user clip plane - we just bend the frustum and move on with life.  How cool is that?!!

Tuesday, December 09, 2008

iPhone + PVRTriStrip + Degenerate Tris = Um...

Hopefully I am not splatting any NDAs in posting this, but...if you are using PVRTriStrip.cpp (part of the SDK for Imagination Technology's GPUs) be sure not to pass "degenerate" triangles to their strip builder.  A degenerate triangle would be one with only two vertices, e.g. indices like 4 4 5.

As far as I can tell, this case causes the tri-strip finder to sometimes lose valid triangles and instead output incorrect sequences.  I don't understand exactly what's going on, but when I filtered out degenerate triangles, the problem went away.

Having your index buffers in strip order is very important for fast geometry throughput!

Saturday, December 06, 2008

Dividends and Free Software

People like me (that is, people who are paid to write computer programs, not people who compulsively rant on the internet) should probably be considering the meaning of free software. By free I don't mean as in freedom, but rather free as in beer and freedom. While crazi^H^H^H^H^H visionaries like Richard Stallman emphasize the difference, we cannot deny that one of the impacts of having free(dom) software is to drive the price of that software down. You can buy Linux (or rather buy a distribution, maybe with support), but if you would like Linux for free, that's a very, very viable solution. (For my three installations of Linux, I have paid a total of $0.)

For the sake of argument, let's call this phenomenon "cheap" software - that is, as a user, your total outlay of cash for software might be a lot less than it used to be as (1) features you used to have to buy become integrated into components you own (e.g. compression/decompression is now free with every major operating system) and (2) software that you might have to buy has become free in itself (e.g. the operating system, your word proessor).

Why is there free software now (when it wasn't such a major part of the software ecosystem 20 years ago)? I think there are perhaps three reasons:
  1. Communication technology has improved, making it possible to aggregate programmer effort easily. (In other words, we can all work on code over the Internet.)
  2. Dividends from faster hardware have been invested in programmer productivity. (In other words, we get more done because we're not all coding in assembly language.)
  3. We have free time.
I think there is no question that having the price of software decline is a good thing. It would be absurd to argue that improvements in farming are bad because it means we have fewer farmers making most of the food - we can all recognize that without this, no one could take jobs making cars and houses and everything else that makes us, as a society, wealthy. In fact it's this very wealth effect (that is, increases in productivity mean we have to work less) that makes free software possible.

What got me thinking about this lately is my friend pointing out that Microsoft is offering people cash to use Windows Live Search. I see this as a sign that Microsoft is really, really screwed. Thy are a huge company, but their glory days are over - they will never again control a monopoly like the one they had on the desktop. Pundits have argued that they need to transition into a large, stable, diversified company, but I don't think it's going to happen for two reasons:
  1. Computer technologies tend to be winner-take-all. If Microsoft can't take all of a new hot area, they won't be the winner.
  2. The price of software, their core business, slowly and perpetually moves to $0.
If Search is the next big thing, it's really too late for them to try to claw their way back, even with cash. The winner (Google) is already taking all.

Remember the good old days when a PC cost over $1000 and it was a good business to make PCs? What happened? We hit a point where consumers didn't want more, they wanted cheaper. Rather than put the hottest new chips in the computer, manufactures had to compete on price, and now you can get a PC for $400. It's not good to be in that market any more.

The operating system has gone the same way. Users don't want any more operating system. Now that every major operating system has protected virtual memory, tolerable parallel performance, a journaled file system and a GUI on top, there just aren't new interesting features. I don't know what will be new in OS X 10.6 and (at risk of blasphemy) I don't really care...there's just nothing more Apple can add to the OS that is interesting to the vast majority of its customers. Whatever is in the next Windows, the only thing they can do right is clean up the disaster that is Vista, and that hardly makes a hit. (You shouldn't have to release a broken version of your software just to create the opportunity to improve in the next version.)

In this environment, what users will want is not new features, what the will want is something that works and costs as little as possible. Where could I get an operating system like that?

It will be interesting to see how the battle for embedded platforms plays out. The difference betewen the desktop and the embedded markets are:
  • Open source is already here while the market develops - no head start for commercial software companies.
  • The margins are very, very low - price matters so much more.
  • There is no legacy of X11/Motif - free software isn't starting with a ball and chain tied to its ankle.
I'm not betting on Microsoft.