Tuesday, December 23, 2008

Compressed Vectors - Part 2

My last post introduced an STL vector that uses RLE to save space and an index for reasonably fast random access.

But that's not the whole story.  We're trying to compress the "snapshot structure".  Those familiar with the lowest level of X-Plane multiplayer networking know that X-Plane uses a packed structure to describe one "frame" of the sim - that is, plane location, control surface deflections, needles, etc.  It is very heavily packed (both with and without loss, depending on the parameter).

The flight replay is essentially a giant vector of these structures.  So...how to utilize our compressed vector (which only compresses identical data)?

The answer is the plane_old_data compressed vector.  Basically this vector relies on the fact that its items are plain old data (POD) and stores each struct across many vectors (each usually containing 4-byte int or 8-byte long longs for items).  Call each of these vectors that covers a sequence of cross sections of our struct "track".  (Those who have used ProTools will understand why I say this.)

Now we can use our compressed vector for each track.  The result is that consecutive values over time at the same offset in our struct get compressed.  If the struct is filled with parameters (some of which vary, some of which don't), we get to compress the ones that do.

Once we get this structure integrated, we'll see what kind of packing efficiency we get.

Monday, December 22, 2008

Compressed Vectors - Part 1

Normally for data that has a lot of repeating values, I use run-length encoding - it works particularly well when the data is accessed from start to end in forward order.  For example, the spot lighting masks for X-Plane's 2-d panels are RGBA image that are RLE compressed. Continuous scan lines of transparent window become a single RLE line segment.  This actually makes processing faster, because the decision to skip transparent areas is made once per scan line, not once per pixel.

Things get a little bit trickier if we need random access to the data.  The structure I am playing with now provides something similar to that.  It is a vector where continuous values are stored once in runs.  It has two parts:
  • A run index (stored as a sorted vector of key-value pairs) maps the beginning of a run (in the virtual data) to the start of storage (in a real data vector).
  • A real data vector contains each value as few times as possible.
We don't need to store "how long is this run" in the data vector - the index gives it to us.  We just look at the next index key-value pair and see (1) how long is our run and (2) how much data is behind it.  If only one value backs our 5-value run, we are run-length compressed.

Complexity properties: call N the number of items, U the number of unique items, and R the number of runs (that is, the number of times we change from RLE to non-RLE data.
  • In the worst case (totally non-sequential data), the storage is the same as vector: O(N).  In the best case (all the same) we are O(1).  In both cases, R=1.
  • Generally our storage is O(R+U), which should be less than O(N).
  • Random access is O(log(R)) - not quite constant time, but pretty fast.  For the worst case random data, R is 1, so we can be fast that way.
  • Pushing data onto the end is constant time plus any cost of memory reallocation; since the vector is compressed, hopefully that is less.
  • Deleting data from the front is O(R+U) instead of O(N), which is a bit of an improvement.
At this point the unit tests work, but I don't have stats on real-world compression.

Sunday, December 21, 2008

Skinning MedaWiki

We use MediaWiki for the X-Plane Wiki, which is meant to be a collection of online documentation for tech support, third party editing, etc.

After pinging some users on my blog, the consensus seemed to be that while some people liked the interactivity of a Wiki, the usability of the site wasn't as good as "normal" doc websites (e.g. made with HTML, php, and some CSS).  I can see why they said that - the regular Wiki monobook has a lot of navigation space dedicated to "wiki-like" activities (e.g. editing and maintaining) and very little dedicated to search and navigation.

After looking at my own php rolled for the scenery site and MediaWiki, it became clear that it would be easier to make MediaWiki pretty than to make my php interactive.  So I built a new skin.  This is what I learned (the hard way):
  • In hindsight, using Monobook as a starting point was really, really stupid.  My goal was a very different layout from MonoBook, so I probably could have saved time by starting with a simpler skin.
  • On the other hand, MonoBook's php does output layout elements in a relatively clear way, so if you need to move things around on screen by changing the document hierarchy (as opposed to using a ton of CSS) it might be easier to start with MonoBook and move sections around.
  • Probably the smartest thing I could have done would have been to delete the whole style sheet and define every DIV block as having a border, padding and margin, so I could visualize the document layout.  What I did instead (slowly beat the style sheet into submission) took a lot longer.
Here are the old and new layouts.  The main ideas were:
  • Eliminate the side bar.  Important side bar items like search and the wiki name go up top; less important side bar items go to the bottom.
  • Use less styling - monobook has a ton of horizontal rules and borders all over the place. X-Plane Wiki content is often complex enough that readers don't need eye distractions. Most of the page is dedicated to document content.
  • Category breadcrumbs are enabled, and moved to the top, with much smaller style-sheet elements.  This provides "hierarchical" navigation (assuming the Wiki's category system is maintained).
The next step will be to redo the main portal pages (particularly main_page, which was never properly built) to give users a clear idea of where they need to go.

Monday, December 15, 2008

Transform Normals Directly

EDIT: Technique 2 (origin differencing) is, besides a terrible idea, not the fastest way to transform normals. See this post for a much better treatment of the subject!

There are two ways to transform a normal vector N given a matrix M:
  1. Compute the inverse of M, transpose it, and use that new M' to transform the normal vector.
  2. Transform M directly, and transpose the origin (0,0,0) then subtract the transformed origin form the transformed normal. This is like transforming the two end points of the normal.
Which is better? Well, I'd say it depends.
  • If you are transforming a lot of normals, calculate the inverse and transpose it once. Now you can invert normals directly.
  • If you are transforming only one normal, it might be cheaper to transform two points rather than invert a 4x4 matrix.
But...there is a danger to the "transform the points" technique: loss of precision!

If your matrix contains a large translation, then the result of transforming the normal and origin will be two points that are close together, but far the origin. This means that you have lost some precision, and subtracting won't get it back!

The inverse/transpose method does not have this problem; the transpose moves the translation part of the matrix off into the bottom row where we don't really care about it - the remaining matrix terms should have fairly small magnitudes and not lose precision.

Saturday, December 13, 2008

Fun With Übershaders

An Übershader (üshader) might mean different things to different developers.  A üshader always implies a single unified shader with a lot of capabilities, but how this is implemented could fall into two camps:
  1. Flow control in the shader.  This is good for an application that has poor OpenGL state sorting, because you can now draw everything with one shader and let conditionals sort it out.  But ... conditionals are not very fast (and in some hardware, not very existent), which leads to the other option...
  2. Create smaller specialized shaders by pre-processing.  This essentially is a loop-unrolling of option 1 - if an app has good state sorting, it makes sense to produce a specialized small shader for each case.
X-Plane uses option 2 - we support a lot of first generation shader hardware (no conditionals) and we were able to leverage existing state-sorting code (designed to avoid texture binds).  For the rest of this discussion, üshaders assumes implementation two, the mechanical production of many specialized small shaders from one larger piece of code.

In our case, the üshader accomplishes a number of nice things:
  • We wrap it in an abstraction layer, so that all graphics hardware looks "the same" to X-Plane - we even have a fixed function implementation of the üshader interface.  (This means: we have to live with whatever simplifications the fixed-function implementation makes - it is very hard for client code to know how its drawing request is "dumbed down". But we can live with this.)
  • The abstraction layer contains the OpenGL setup code for the shader, which keeps the huge pile of uniform setups and binds in one place.  One problem with GLSL development is that you end up with two separate chunks of code (the GL setup and the GLSL shader) that can't be out of sync.  The üshader keeps the GLSL in one place, and the abstraction wrapper keeps the GL setup in one place.
  • The abstraction layer presents a very orthogonal view of features, which helps us build a rendering engine that "does the right thing", e.g. if basic terrain can have cloud shadows, then two-texture terrain, runway terrain, road surface terrain, and every other type can also have cloud shadows.  Given the ability to plug user-authored content into the sim, we can't predict what combinations we might need to support.
  • The abstraction layer can build the shaders lazily, which keeps the number of shaders we keep around down to sane limits.  This is slightly frightening in that we don't (normally) generate every possible shader - conceivably there could be GLSL compile errors in certain odd combinations.
GLSL Optimization

We must consider the question: does this technique produce good GLSL shaders?  Well, it clearly produces better shaders than simply building a non-specialized üshader, which would contain a lot of extra crud.  The big concern for a üshader would be that the different parts of the shader, when glued together, would not produce optimal code.

So far I have found this to be not a huge problem.  GLSL compilers have come a long way - it appears to me that the days of having to hand-roll GLSL to optimize are over.  A year or two ago, I would be combining expressions to improve code performance; now I use temporaries and let the compiler figure it out.  That's enough compiler smarts to glue shader parts together as long as they really are necessary.

There are some cases that the compiler can't optimize that cause the conditionalized code to become overly complex.  For example, if my shader outputs a varying and the next shader down the line doesn't use it, this is discovered at link time, so I don't expect to get any dead-code elimination.

There is one way to produce substantially better shader code, and that is to recognize when parts of the shader features are not necessary.  For example, when we are doing a depth-only shadow render, we don't need to do any texture reads if the texture doesn't have an alpha channel.

One way to achieve this is to use a "feature optimizer" - that is, a specific piece of code that examines the naive requested shader state, and converts it to the true minimum required state (e.g. turning off texturing for non-alpha textures during depth-only shading).  Given clients of the shader code all over the app, having one optimizer as part of the shader interface means consistent application of optimization all the time.

GL Optimization

Perhaps more important than GLSL optimization is GL state optimization - making sure that we change as little as possible between batches for fastest processing.  On the fastest machines, users need to increase "object count" (replicated objects whose meshes are resident in VRAM) to really utilize their GPU's power.  At that point, state change becomes the critical path, even on a very fast machine.

Unfortunately, it can be difficult to tell what set of state transitions the renderer will have to make - it depends a lot on culling, as well as user-generated content.

One way to improve state change is to keep local caches of commonly changed state to avoid calls into the driver.

A more aggressive optimization is to provide "specific transition" functions for the shader.  So as well as an API to "set up the shader with these settings", we have APIs that change specific commonly changed, easy to recognize small aspects of the shader, ideally using an optimized path.  For example: often in an airplane we change just one texture (but no other state), and we know we are only changing the texture.  We can avoid a lot of thrash by recognizing this and using an optimized code path.

Clash of the Optimizers

Here's where things get ugly: can we still recognize a fast transition situation when there is an optimizer trying to alter GL state?  Naively, the answer is no.  For example, consider the case where we eliminate texture read for no-alpha textures.  If we now want to do a fast "texture only" change, we can only do this optimally if both textures have the same alpha channel (or lack thereof) - and in fact, the optimal case is to do absolutely nothing if both textures do not have alpha.

Thursday, December 11, 2008

Pinching Cascading Shadow Maps

The battle with cascading shadow maps (CSM) is always one for resolution.  We could use 25 layers of CSM, but that would defeat the whole purpose of CSM.  Ad-Hoc shadow maps deliver really good, precise shadows, but with two draw-backs:
  • They are horribly dependent on the size of the objects in your world.  For small objects they produce crisp shadows - for big ones they produce muck.
  • Odds are the number of objects floating around (cars, buildings, etc.) is several orders of magnitude larger than the number of CSM layers you might use.  I get good results with 8 CSM layers, and can probably reduce that with careful optimization.  I usually have more than 8 buildings on screen.  (That means a lot of thrash as we create, then use each shadow map, with GL setup each time.)
Yesterday I found (by misunderstanding the NVidia White Paper) a way to squeeze a little bit more detail out of my CSM layers.

For each "layer" (that is, a distance-wise partition of the user's frustum that gets a separate shadow map) we normally calculate the shadow map's bounding cube around the corners of the user's view sub-frustum (for that layer).

But that's really a much bigger bounding box than we need.  For the price of an iteration over the scene graph* we can calculate a smaller bounding box that is "pinched in" to the edge of the content we need to draw.

Is this a win?  Well, it depends...on...
  • The size of your scenery content.  You won't get a pinch in that's much smaller than the smallest indivisible entities in your world.
  • The overall shape of your scenery content - if it really fills the frustum, no win.
Evaluating this on scenery, I am seeing:
  • Very little benefit for the nearest layers...they are usually so small (a few dozen meters) that they include much larger scenery entities.  (Our ground patches can be several kilometers, so no win.)  But for the far layers, we might reduce the bounding box by 25-50% in some dimensions...that's almost like doubling your texture!
  • The shape of our content is a win.  Since the world is sort of flat and 2-dimensional, usually at least one axis of the bounding box (depending on the sun's angle) is horribly wasted.  That's where we get a win.
Example: the user's camera is on the ground, facing north.  The sun is directly overhead.  Now the user's frustum indicates that, in the farther layers, content that is significantly below or above the ground would be visible.  (We don't have occlusion culling here.)

But in practice, there is no scenery below the ground.  So we can "pinch in" the far clip plane of the sun's camera (which is really far from the sun, just in case anything below the surface of the earth is visible), bringing that far clip plane all the way up to the lowest ground point. Similarly, if we're not shadowing clouds (they are handled separately) the near clip plane can be pushed down to the tallest building.

This makes the far layers much, much more useful.  Normally if the layer is 10 km away at 60 degrees FOV, the bounding box for the shadow map is going to have 10000 meters from its near to far plane.  If we "pinch in", we can reduce this "depth of field" to the difference between the lowest and highest scenery items, which might only be 100 or 200 meters.

(This is of course a best-case scenario...put a mountain in there and angle the sun a bit and the win is much more modest.)

As a side effect, the scene graph traversal lets us completely eliminate layers that contain no content - I am finding I drop at least one layer that way.

EDIT: all of the above data is totally misleading - for shadowing 3-d content, the above is true. But if the terrain mesh is included (it is much larger, and its granularity is larger), the savings all vanish.

* Lazy evaluation can make this a lot faster - simply skip whole sub-trees of the scene graph that are already entirely inside the "pinched" cube)

Wednesday, December 10, 2008

Oblique Frustum Culling is Cool

Every now and then I come across a graphics algorithm so out there that it just blows my mind - it feels like cheating the laws of physics or something.  Oblique Frustum Culling is one of those algorithms.

Here's the situation: to make a pretty water reflection, you need to reflect the camera position around the water plane - take a picture, and then use that projected texture later to draw the water reflections.  There's one hitch: you have to clip out everything below the water.  If you don't, you get this.

The inset picture is the "from-the-water" view...note that the pillars of the bridge go way below the water (so one art asset can be used at many elevations).  But since they are not obscured by the water (when viewed from below the water), they show up in the reflection texture as the dark areas behind the bridges.  (This is because the reflection texture's projection doesn't know what "forward" and "backward" are along the projected light ray.)

You can't fix this using depth buffering or by drawing the water - the water would occlude everything you do want.  So you need a user-specified clip plane.  Clip everything below the water and go home happy, like this.

As I have blogged before, clip planes and shaders don't always work well together.  (And by don't always, I mean never.)  If I remember correctly, one vendor requires you write the clip vertex while the other requires you don't.  To make it more fun, the latest GLSL specs will deprecate the clip vertex entirely in terms of gl_ClipDistance (which makes more sense actually).  Mix that with drivers for OS X and for Win/Lin, and users who aren't on the latest GLSL compilers, and you have enough anarchy to want to kill yourself.

That's where oblique frustum culling comes in.  Basically it's an algorithm that takes the near clip plane of the view frustum and distorts/bends/twists/performs voodoo on it so that it is your user clip plane.  Quite literally everything below water is now "too close" to the camera.

What I really love about this algorithm is: no extra state - we don't have to have conditionals in our shaders for the (rare) time we need our user clip plane - we just bend the frustum and move on with life.  How cool is that?!!

Tuesday, December 09, 2008

iPhone + PVRTriStrip + Degenerate Tris = Um...

Hopefully I am not splatting any NDAs in posting this, but...if you are using PVRTriStrip.cpp (part of the SDK for Imagination Technology's GPUs) be sure not to pass "degenerate" triangles to their strip builder.  A degenerate triangle would be one with only two vertices, e.g. indices like 4 4 5.

As far as I can tell, this case causes the tri-strip finder to sometimes lose valid triangles and instead output incorrect sequences.  I don't understand exactly what's going on, but when I filtered out degenerate triangles, the problem went away.

Having your index buffers in strip order is very important for fast geometry throughput!

Saturday, December 06, 2008

Dividends and Free Software

People like me (that is, people who are paid to write computer programs, not people who compulsively rant on the internet) should probably be considering the meaning of free software. By free I don't mean as in freedom, but rather free as in beer and freedom. While crazi^H^H^H^H^H visionaries like Richard Stallman emphasize the difference, we cannot deny that one of the impacts of having free(dom) software is to drive the price of that software down. You can buy Linux (or rather buy a distribution, maybe with support), but if you would like Linux for free, that's a very, very viable solution. (For my three installations of Linux, I have paid a total of $0.)

For the sake of argument, let's call this phenomenon "cheap" software - that is, as a user, your total outlay of cash for software might be a lot less than it used to be as (1) features you used to have to buy become integrated into components you own (e.g. compression/decompression is now free with every major operating system) and (2) software that you might have to buy has become free in itself (e.g. the operating system, your word proessor).

Why is there free software now (when it wasn't such a major part of the software ecosystem 20 years ago)? I think there are perhaps three reasons:
  1. Communication technology has improved, making it possible to aggregate programmer effort easily. (In other words, we can all work on code over the Internet.)
  2. Dividends from faster hardware have been invested in programmer productivity. (In other words, we get more done because we're not all coding in assembly language.)
  3. We have free time.
I think there is no question that having the price of software decline is a good thing. It would be absurd to argue that improvements in farming are bad because it means we have fewer farmers making most of the food - we can all recognize that without this, no one could take jobs making cars and houses and everything else that makes us, as a society, wealthy. In fact it's this very wealth effect (that is, increases in productivity mean we have to work less) that makes free software possible.

What got me thinking about this lately is my friend pointing out that Microsoft is offering people cash to use Windows Live Search. I see this as a sign that Microsoft is really, really screwed. Thy are a huge company, but their glory days are over - they will never again control a monopoly like the one they had on the desktop. Pundits have argued that they need to transition into a large, stable, diversified company, but I don't think it's going to happen for two reasons:
  1. Computer technologies tend to be winner-take-all. If Microsoft can't take all of a new hot area, they won't be the winner.
  2. The price of software, their core business, slowly and perpetually moves to $0.
If Search is the next big thing, it's really too late for them to try to claw their way back, even with cash. The winner (Google) is already taking all.

Remember the good old days when a PC cost over $1000 and it was a good business to make PCs? What happened? We hit a point where consumers didn't want more, they wanted cheaper. Rather than put the hottest new chips in the computer, manufactures had to compete on price, and now you can get a PC for $400. It's not good to be in that market any more.

The operating system has gone the same way. Users don't want any more operating system. Now that every major operating system has protected virtual memory, tolerable parallel performance, a journaled file system and a GUI on top, there just aren't new interesting features. I don't know what will be new in OS X 10.6 and (at risk of blasphemy) I don't really care...there's just nothing more Apple can add to the OS that is interesting to the vast majority of its customers. Whatever is in the next Windows, the only thing they can do right is clean up the disaster that is Vista, and that hardly makes a hit. (You shouldn't have to release a broken version of your software just to create the opportunity to improve in the next version.)

In this environment, what users will want is not new features, what the will want is something that works and costs as little as possible. Where could I get an operating system like that?

It will be interesting to see how the battle for embedded platforms plays out. The difference betewen the desktop and the embedded markets are:
  • Open source is already here while the market develops - no head start for commercial software companies.
  • The margins are very, very low - price matters so much more.
  • There is no legacy of X11/Motif - free software isn't starting with a ball and chain tied to its ankle.
I'm not betting on Microsoft.

Saturday, November 29, 2008

CGAL + GMP

If you use CGAL with GMP for numerics, you might want to compile a debug version of GMP. The problem is: GMP will abort if it gets bogus input (like a NaN floating point initializer).  But GMP is normally coded with stack back-links disabled for speed.  This means that if you have a NaN in your code, as soon as CGAL passes it to GMP you get a crash with no back-trace to see how it happened.

The fix is to recompile GMP in some kind of debug mode for development purposes.

Wednesday, November 26, 2008

CSM vs. Ad Hoc Shadows - Quality

Comparing the quality of ad-hoc shadows vs. CSM...there are some cases where ad-hoc does a lot better.  In particular, given a character or vehicle near the camera, ad-hoc usually looks better, because a lot of shadow map res is dedicated to a relatively small area.

Where CSM excels is in very large models or very large terrains where there is no good scene-graph based decomposition.  For example, ad-hoc shadows on terrain by decomposing sub-parts of the mesh works rather poorly - decompositions really need to be based around view frustums (which is exactly what CSM does), not based on world coordinates.

One thing I haven't been able to quantify yet is the cost in triangle count to CSM.  If you look carefully at the CSM scheme, you'll see that at certain camera vs. sun angles, there can be significant overlap between the shadow volumes, and that means multiple iterations over the scene graph content that is in the "shared" location.  If the model in that location is expensive, this can be a potential performance problem.  

(For example, if you work on, oh I don't know, a flight simulator, there is a chance that the user's airplane is significantly more expensive than anything else in the universe...if it spans several CSM volumes, you're going to feel the pain.)

Finally, from what I can tell, while it is more efficient (fill-rate wise) to apply all CSM volumes at once, it is not strictly necessary from a quality standpoint - I'm not seeing a ton of artifacting from applying CSM volumes separately via stenciling.  

(The artifacts would be from the overlap of low and higher res shadow maps..what I have found is that if the CSM scheme uses enough splits to really look good, the overlap regions are small and don't differ that much in quality between the lower and higher res shadow map that overlap.)

nVidia's CSM demo uses four shadow maps -- my tests required six at first, but upon examination, it looks like the closest and farthest map are too close and too far to be useful, and could be dropped.

Tuesday, November 25, 2008

Ad-Hoc Stenciled Shadow Maps

Previously I blogged a design for combining G-Buffering with shadow mapping using the stencil buffer. I doubt that this is an original idea; the GPU Gems 2 chapter on G-Buffering (a la CRYSIS) mentions that G-Buffering and shadow mapping work well together.

I would describe the G-Buffering + stencil + shadow mapping approach as "ad-hoc" shadow mapping because the approach lets you compute any number of arbitrary shadow volumes and apply them to screen-space areas.  Because we are using the stencil buffer, it doesn't matter if the shadow volumes overlap.  We can simply pick out the most important parts of the scene (closest to camera, biggest, important to user, flagged by artist) and shadow those.  We can shadow fewer models or all models, determined by runtime settings.

Wait, what do I mean by "shadow volume"?  Well, a shadow map is a 2-d texture, but the value of a pixel in that 2-d texture is the "nearest occluder" from the sun.  Since the minimum and maximum gray-scale value correspond to distances from the sun, we can think of the shadow map effectively specifying occluder information within a cube that is aligned such that the sun sees exactly one square face of the cube.

Cascading Shadow Maps (CSM) uses a similar approach to fix some of the weaknesses of shadow mapping, namely:
  1. That shadow volume is very much resolution limited - both by texture dimension (X and Y axes) and texture precision (Z axis).  For very large scenes, you run out of res long before you get a nice looking shadow.
  2. Often the alignment of the sun and user's viewpoint are such that your pixels are being spent where the user can't see them, which is wasteful.
CSM solves both of these problems by using multiple shadow volumes in multiple maps, using a smaller volume for the near part of the view frustum (which is naturally smaller since the view frustum gets larger as it goes away from your eyeball).

Typical CSM designs will simply use 2, 3, or 4 shadow maps simultaneously, looking up the shadow per pixel in each one.  But this design can be adapted to ad-hoc shadow mapping -- with ad-hoc shadow mapping, we simply build each shadow map (near, far, very far away) in series, and apply each one to the stencil buffer.

Since there is no penalty for overlapping shadow volumes, we can even combine ad-hoc and CSM approaches - we can run several shadow volumes along the view frustum (a CSM-like approach), excluding "high value" content - and then separately shadow that "high value" content, each with its own shadow map.  The result is generally good shadow precision, and arbitrarily precise shadows for models that require extra res.

Wednesday, October 29, 2008

CGLA: Abusing merge_edge

The CGAL arrangement_2 operation "merge_edge" takes two connected half-edges (connected by a vertex of degree 2) and makes them one. Useful! But there is some fine print:

merge_edge is only meant to be used when the new combined curve covers the exact same path as the old curve. If you are using line segments as your curves, this means the three vertices involved (the middle vertex is the one to be deleted) are collinear.

Can merge_edge be abused to remove edges that bend? Yes, sort of, if you are very careful:
  1. You must verify that the new combined edge does not cross any other edges, and will be in the same face as the existing two edges. This limits how much you can move an edge by combining. Basically merge_edge can't change topology!
  2. The two half-edges being merged must go in the same direction! Otherwise, the half-edge direction flag could be wrong after the merge.
None of this is very safe - safer (but slower) is to simply add the new edge and delete the two old ones.

Tuesday, October 28, 2008

STL Priority Queue

The STL on my Mac seems to have a "priority queue", but it's not quite what I want - it's just a wrapper around a heap.  What I want in a priority queue is:
  1. log-time insertion of elements.
  2. Constant time popping of the front.  (Actually, I can live with log-time here.)
  3. The ability to reprioritize a single item in log time, based on the item itself.
  4. Priorities stored with the elements (rather than calculated by a comparator).
  5. Ability to put items into the queue without adding special "helper" fields.
Those last two points aren't critical, but they are very convenient.

Item 3 is what makes things interesting...for a lot of priority queue algorithms, the algorithm is only fast if you can reprioritize just a few affected other items without going over the entire queue.  (Doing an entire-queue operation will typically give you N-squared time, because you hit the entire queue to reprioritize every time you process a single item.)

There are two ways I've used to do this:
  1. If I store the priority in the item itself (violates the last two points), I can just insert the items into an STL multi-map by priority.  To reprioritize an item, I find it by doing an equal_range query on its current priority, erase it, then reinsert it at the new priority.
  2. To meet all the point, I use a map from priority to item, but then maintain a separate "back-link" map from items to iterators into the original map.  This gives me the ability to find an item by value and remove it from the priority queue and reinsert it.
The system can be made more efficient by using a hash map instead of a regular map for the back links, but it requires a hashing operator to be defined for the values, rather than just operator<.

Friday, October 24, 2008

User Clip Planes and GLSL

To combine a GLSL shader and user clip planes, you need to do two things:
  1. Set up your clip plane normally using glClipPlane and glEnable(GL_CLIP_PLANEx).
  2. Make sure to write the eye-space transformed vertex position to gl_ClipVertex in your vertex shader.
You do not have to do the clipping yourself - you just need to output an eye space vertex from your vertex shader so that the GPU can do it for you.  (Remember, normally your shader outputs only clip-space coordinates, potentially skipping eye-space entirely.)

And this should make sense - the shape of a clipped triangle may not be a triangle - it could be a quadrilateral which must then be decomposed into two triangles for the setup-engine.  So clipping needs to be done by dedicated hardware - your shader just has to give it access to eye-space coordinates.

(Clip planes are similar to eye-space tex-gen and lights in that you specify your input in object space, but the clip plane is then transformed to eye space at input time.  This means that the clip plane is stored "camera relative" and will not move as you move the model view matrix. This is a good thing - clip planes are typically global in nature, so we don't want them to move each time we move the original to draw a car or some other moving mesh.)

EDIT: of course, you'll get software render if you touch gl_ClipVertex on most hardware.  Your options:
  • Use fixed function pipeline.
  • Use pre-writing the Z buffer if you can.
  • Case this out by card.

Thursday, October 23, 2008

CGAL Arrangements - How Do You Find Your Edges?

A fundamental client problem with the CGAL arrangement_2 class: how do you recover handles to the edges you insert?  Note that the insertion APIs do not return halfedge handles unless you use very specialized (read: limited) functions.

I have found basically two strategies:

If you can afford to iterate across every single half-edge in the arrangement, use Arr_consolidated_curve_data_traits_2 to tag your input curves with an index number back to the original source data.  

After the insert (and you can use any kind of bulk insert you want), simply iterate across all edges (remember, a single curve is shared between two halfedges), pull out your key, and you now have a link from half-edge to original curve.  This is a many-to-many relationship - the consolidated cure traits will make sure that multiple keys make it to a half-edge if needed.

If you are doing a small amount of incremental work, attach an observer.  You will need to override the following observer messages:
  • after_create_edge to detect new half-edges induced for your insertion.
  • after_modify_edge to detect cases where your edge completely overlaps an existing half-edge.
  • after_split_edge to detect the case where your edge partly overlaps an existing edge.  (The existing edge will be split so that there is a vertex at the end of your inserted curve.)
  • If the curves you are inserting intersect with each other, after_split_edge also needs to detect the case where an edge you previously inserted and care about is split in two.
I use the former strategy for the bulk creation of a map from source data - I'm going to have to apply traits to every edge, so I might as well let the bulk insertion run without a ton of individual callbacks, then go through later.

I use the latter strategy to detect induced rings when I am adding small areas and not using the sweep-line algorithm.

Future work: I have not had time to analyze whether incremental observation is better/worse for performance than bulk insertion with consolidated data keys.

To Sweep or Not To Sweep

I've spent the week porting the X-Plane scenery generation code back to CGAL.  I will describe the history of this code and its constant battle with robustness in another post, but for now I will just mention that credit goes (besides obviously to the authors of CGAL, who are geniuses of the Nth degree) to Andrew McGregor for a lot of the ideas on how to use CGAL with the scenery generation code - the final design is really just a refinement of his work.

The X-Plane scenery generation code (which we just call "the render farm") is the code that turns hundreds of GB of GIS data into scenery.  At its heart is a topologically integrated planar map of all vector features in a given tile - every road, river, coastline, change of terrain type, etc. is baked into one map.  A typical map might have half a million edges, covering a 1 x 1 degree area.

For X-Plane 8.20 and 9.0 we shipped using a planar map I wrote myself.  While the planar map didn't work very well, it had a few special cases that were highly optimized.  In particular, given two sub-regions of two maps, with an identical ring of half-edges identifying them (identical meaning identical coordinates) you could swap the interiors of the two regions in time proportional to only the number of elements inside the ring.

It turns out that this is a really useful trick for a few things:
  • If you have a huge map you want to tile, you simply induce the cut lines, and swap each single tile with an empty map.
  • If you want to "overlay" a polygon, or even a sub-map (destroying what is below) you simply induce a border ring in both polygons and swap.
  • You can crop a map's interior or exterior by inducing a ring and swapping with a blank map.
This was an operation where the devil is in the details - because components are swapped "in-place" the swap doesn't have to copy, allocate, or deallocate the map, so it turns out to be fast in pracatice.  In terms of time complexity, it's very fast for very small operations on a huge map, which turns out to be an important case for us.

Now CGAL's arrangement_2 gives me some power tools* that are very useful - in particular, sweep-line algorithms.  The link shows a good explanation.  Basically the power of the link is to process a set of line segments in a lot less than O(N^2).

But there's a slight problem with sweep-line - it has to iterate over the entire set of edges in a map, including the ones already in the map.  That means that inserting a small number of edges into a complex map is surprisingly slow.

So in some cases I now have two implementations of various map operations:
  • Sweep-based implementations, which are good when we're inserting a lot of data at once.
  • Incremental implementations - theoretically much worse, but acceptable to keep the size of our data set down.
So far, incremental operations are proving to be a big win when "burning airports" - that is, merging or overlaying very small airport boundary areas into a full map.

* Of course, the real advantage of CGAL is that it is robust and doesn't randomly blow up during operations!

Wednesday, September 10, 2008

So Where Is That Fast Path?

Amongst the many rants I've read about the new OpenGL 3.0 spec, is the claim that the spec needs to be cleaned out and rebuilt so that people can "find the fast path".

What application developers are getting at with this is that OpenGL is a rich API, and not all cards do everything at top speed.  There is usually one fastest way to talk to OpenGL to get maximum throughput.

The problem is: this is ludicrous.  Case in point, the GeForce 8800 - in my Mac Pro, running OS X 10.5.4 and Ubuntu 8.  So what is the fast path?

If I draw my terrain using stream VBOs for geometry and indices that are not in a VBO, I get 105 fps on Linux.  If I then put the indices into a stream VBO, I get 135 fps.  The fast path!

Well, not quite.  The index-without-VBO case runs at 73 fps on OS X, but once those indices go into a VBO, I crash down to 25 fps.  Wrong fast path.

Simply put, you can't spec the fast path, so the spec doesn't matter.  You find the fast path by writing a lot of code and trying it on a lot of hardware.  I can't see there ever being another way, given how many different cards and drivers are out there.

Monday, September 08, 2008

Geometry Shader Performance on the 8800

So this is what I learned today: when it comes to geometry shaders and the 8800, triangle strips matter. Now after you read the details, this will seem so obvious that you can only conclude that I am a complete dufus (something I will not necessarily dispute). But the 8800 (like most modern cards) is so bloody fast that triangle strips are actually not a win in almost all other configurations.

The test: a mesh of 1000 x 1000 quads (each in turn is two triangles), being rotated. Using a single static vertex buffer with static indexes, this runs at around 50-55 fps. Each vertex has 8 components (XYZ, normal, texture ST).

Now some numbers:
  • The baseline is around 54 fps.
  • Cutting the geometry to a 500x500 mesh brings us to around 204 fps, which is what we expect for a vertex-bound operation. The pixel shading has been kept intentionally simple to achieve this result.
  • Using a geometry shader which simply passes through the geometry has no affect on fps.
  • Cutting the mesh to 500x500 and using a geometry shader that splits one triangle into four by emitting 12 vertices and 4 primitive (e.g. tris) ends runs at a creeping 25 fps.
  • Cutting the mesh to 500x500 and using a geometry shader that splits one triangle into four by emitting eight vertices and 2 primitives (e.g. strips) runs at 68 fps.
  • When using this strip-based geometry shader, sorting the mesh indices by strip format (e.g. 0 1 2 2 1 3 2 3 4 4 3 5) improves fps to 73 fps or so. When not using the geometry shader, this strip sorting has no impact
Let's tease that mess apart and see what it means. Basically my goal was to test the performance of "dynamically created" geometry (e.g. creating more vertices from less using a geometry shader) vs. "mesh updating" (e.g. periodically re-tessolating the mesh and saving the results to new VBOs. The later technique's best performance is simulated by the 1000x1000 VBO in VRAM; the former by the geometry shader.

As you can see, geometry shaders can outperform straight VBO drawing, but only if they are set up carefully. In particular, you can't have multiple-separate-triangle primitives in a geometry shader output, so if we want to draw distinct triangles, we have to end a lot of primitives. There is also no vertex indexing out the back of a geometry shader, so strips are a win.

(Contrast this to drawing out of a VBO - with indexing and multiple triangles per call, and a huge cost to restarting primitives, GL_TRIANGLES indexed is usually faster than strips.)

What's surprising here is not that strips are faster in the geometry shader, but that they are so much faster! With strips we've cut down the geometry data by about 30% (from 12 vertices to 8), but we get an almost 3x improvement in throughput. My theory is that emitting fewer primitives is what wins; we've cut down geometry and cut the number of primitives in half.

The moral of the story is: it pays to find a way to strip-ify the output of geometry shaders.

Wednesday, August 13, 2008

Fixing Shadow Maps

In the old days of stencil shadow volumes, the code to draw the scene might look something like this:
1. For each light source
2. Clear the stencil buffer
3. For each shadow caster
4. Draw its shadow volume
5. Draw the scene
The big problem here is the "shadow volumes" - that's geometry that may be as complex as the scene, but isn't the scene, and changes as the sun moves. We can get clever and try to write a geometry shader, but in the old day you had to build your shadow volumes. Ouch.

Another problem with stenciled shadow volumes is that "fake" geometry shaped by transparency is ignored.

Shadow Maps

Shadow maps bake distance calculations into a texture, from the light source's point of view. But we have two ways to code this, both pretty painful:
1. For each light source
2. Draw the scene (depth only) from the light's 3.perspective
3. For the scene
4. Draw the scene with a shader that knows about the shadow maps
Of course while that looks nice and fast, that shader features one shadow calculation per light source.

An alternative is to build the scene in passes:
1. For each light source
2. Draw the scene (depth only) from the light's perspective
3. Draw the scene with a shader to accumulate that light's contribution
The first option uses a huge shader and a lot of VRAM; the second one hits the scene graph a lot. The second case might be a win if we can use our scene graph culling code to draw only the part of the scene that could be affected by the light (assuming the culling works that well).

It should be noted that for one directional light (e.g. the sun) the shadow map case is great: we need one texture, one extra pass, and we automatically get shadows on everything. Transparent geometry is not shadowed, and we don't have to ask "how do I make a stencil volume" for everything that might show up in the scene graph. Render time for the shadow pass can be very fast because we can skip texturing and color write-out.

The Problem With Shadow Maps

The problem with shadow maps is shadow resolution. You lose resolution in 3 dimensions: the width/height of the shadow map (which shows up as big blocky pixels in the shadow) and "depth" - if the precision of depth values isn't good enough, "close calls" between the occluder and occludee are not resolved correctly due to rounding errors.

Depth precision is less of a factor now that modern hardware supports formats like R32F (a single 32-bit floating point channel) but horizontal resolution is limited by texture size; even a 2048x2048 texture doesn't provide much detail over more than about 500 meters of "world". For a flight simulator, that's not enough.

To make matters worse, the shadow texture's position is based on the light source, not the viewer's camera. This means that often the most detailed part of the shadow map is somewhere useless (like far away), while the ugly crude part is right in front of the user. There are a number of algorithms, like trapezoidal shadow maps (TSM) that try to reshape the shadow map to use the pixels better, but if your scene graph is huge, this isn't an improvement of the right magnitude.

Splitting the Lights

One way to solve the resolution problem is to break the light source into multiple "lights", each shadowing a different part of the scene. (Think of a directional light as really bounded by a cube, because this is the area the shadow map casts a shadow for. We are making many small cubes that cover the volume of the original light cube, which had to be big enough to cover the whole scene.)

This approach gives us a potentially arbitrary improvement in resolution, but it puts us on the "multiple light" scenario - two passes over (at least part) of the scene graph per light. If you have efficient culling there might be a win, but shadows are cast, so a small light volume can end up shadowing a large area at some light source angles.

Stenciled Shadow Maps With GBuffers

I'm not sure if anyone has done this before - for all I know it is a known production technique and I just haven't found the paper. The technique I describe here uses a G-Buffer to save on iterations over the scene graph, and the stencil buffer to save on shadow map memory. If you already have a G-Buffer for deferred shading, this works well.

What Is a G-Buffer

A G-Buffer is a series of frame buffers which encode everything you could want to know about a given pixel. Typically for a given pixel X, at a minimum you would know the XYZ position in eye space, the normal vector, and the unlit ("raw") color of the source point on the source geometry that filled pixel X.

In other words, a G-Buffer is sort of a spatial index, giving you constant time access to your scene graph, given screen-coordinate X & Y as an input. With deferred shading, you can now create light effects - for a given screen-space pixel, you know what a given light would have done to that pixel, because you still have its normal, position, original color, etc.

G-Buffering is a win for lighting because we can radically reduce the number of pixels for which per-pixel lighting is applied.
  • Because we will apply lights later in screen space (by reading the G-Buffer to get the normals, positions, etc. of the pixls involved) there is never any overdraw. If your scene was created by drawing the screen 4x over, the G-Buffer contains the final result, only the pixels that showed up. So if you have 4x overdraw, G-Buffering represents a 4x reduction in lighting calculations.
  • If a light has limited scope (e.g. it is directional or attenuated), you only have to apply the lighting calculation to the small set of pixels that it could affect. With traditional lights, every pixel in the scene must consider every light (or you must turn the lights off and on very carefully). With G-Buffering, the "second-pass" of accumulating lighting can be done for small screen areas by drawing a trivial bounding volume for the light. This means that small lights are much, much cheaper in terms of pixel fill rate.
Be warned that G-Buffering also has three weaknesses that you have to eat just to get started:
  • To draw the scene itself, you have to fill a huge amount of data - perhaps 12 to 16 floating point values per pixel. That's a much higher fill rate to the framebuffer than you'd have with conventional texturing (or even HDR). So you'll pay a price just to have a G-Buffer. You only win if you save enough lighting calculations later to justify the G-Buffer now.
  • FSAA is not an option; the G-Buffer needs exact values from precisely one scene graph element for each pixel. So you may need to render at a higher res (ouch) or apply post-processing to smooth out geometry. (Post processing is recommended in GPU Gems 2.)
  • Alpha-blending works, um, badly. Basically you only get the final values of the front-most pixel. So you will have to compromise and accept a hack...for example, the front-most pixel's lighting calculations applies to the blend of the texture colors of all contributing pixels. If this hack ruins your content, your only alternative is to fully render the scene before blending, then do a second pass - at that point the G-Buffer is probably not a win.
G-Buffering and Shadow Maps

G-Buffering works well with shadow maps; you can "project" a shadow map onto a scene using only the contents of the G-Buffer, not the original scene itself. (To project the shadow you really only need object position, which is in the G-Buffer.)

But what about shadow map resolution? We can use the stencil buffer to accumulate the shape of the shadow cast by each shadowing element in the scene graph, and then bake the light. The full rendering loop would look like this:
1. Draw the entire scene to the G-Buffer
2. For each light source
3. Clear the stencil buffer
4. For each shadow-casting object (for this source)
5. Draw shadow map for this object
6. Project shadow map into stencil buffer using G-Buffer
7. Accumulate contribution of light, using stencil buffer
Note that for the sun-light case, our time complexity is relatively similar to the original shadow-map algorithm: step 5 at the worst case involves drawing a shadow map for the entire scene. (I'm waiving my hands - we'll be filling a lot more pixels, because we've got a full-sized shadow map for individual elements. It's the equivalent of being allowed to have a truly enormous shadow map.)

But what's interesting about this is that the cost of the steps aren't necessarily as high as you'd think in terms of fill rate.
  • Step 5 (drawing the shadow map for one object) can use as many or as few pixels as we want -- we can choose our shadow map size based on the position of the object in eye space! That is, we can finally spend our shadow fill rate budget on things that are close to us. Since each object gets its own shadow map, our max resolution can be relatively high.
  • Step 6 requires us to draw some kind of "shadow casting volume" onto the screen - that is, to cover the pixels that could be shadowed by a given occluder. This has the potential to be a lot less than the entire screen, if we can draw relatively sane volumes. (But the only penalty for a cruder, larger volume is a few more shader ops.)
  • Step 7 requires us to burn in the light, which (as with all G-Buffering) only requires touching the pixels that the light could illuminate; so we get a win for lights with limited range or tight focuses.
This algorithm has some nice scalability properties too:
  • Because we know where in the final scene's eye space we are working, we can tune quality on a per-shadow basis to spend our computation budget where we need it. (E.g. turn off PCF for far-away shadows, use smaller maps, don't shadow at all.)
  • We can even make content-level decisions about shadows (e.g. shadow only buildings).
  • The algorithm trades quality for time. Traditional shadow maps are limited by maximum texture size, which is a painful way to scale. This lets us simply trade a faster CPU/GPU combo for better results, with no maximum limit.
Future Parallelization

If we ever get GPUs that can dispatch multiple command streams, the inner loop (4,5,6) could be parallelized. Essentially you'd have a few shadow map textures, and you'd pipeline steps 5 on multiple cores, feeding them to step 6. So a few GPUs would be prepping shadow maps, while one would be baking them into the stencil map. This would let us scale up the number of shadow-casting elements.

Monday, August 11, 2008

Perspective Correct Texturing in OpenGL

99.9% of the time, you don't have to worry about perspective correct texturing in OpenGL. If you draw a textured rectangle, it will look correct from any angle. But there are a few cases where you need to manage perspective yourself.

I blogged here on what perspective vs. non-perspective texturing looks like when a rectangle is deformed into another shape. This blog doesn't mention a minor detail: OpenGL will never give you that middle image - instead what you get is something like this. (What you see is the two triangles that OpenGL decomposes the quad into.)

Creating Perspective with the Q Coordinate

If you can easily identify the amount of "squishing" you have applied to one end of a quad, you can use the "Q" coordinate to produce perspective correct textures. We do this in X-Plane: the snow effect is a cylinder that wraps around the plane. In order to assure that the entire screen is covered, we close up one end to make a cone in some cases. When we do this, the quads that make the sides of the cylinders become trapezoids, then triangles. By using the Q coordinate technique, we keep our texture perspective correct.

There are two caveats of this technique:
  1. You can't really make the texture go to infinite distance (that would be a Q coordinate of zero). If you do, all texture coordinates are zero and some of your texture information is lost. To hack around this, we clamp to a very small minimum Q. Don't tell anyone!
  2. This technique will produce perspective changes in both directions of the texture! This is the "correct" perspective thing to do, but may be surprising. In particular, as we make the far end of our cylinder into a cone, because visually this is the equivalent of making it stretch out to much further away, the texture is stretched along the cylinder, making snow flakes that are close to us very tall and ones that are far away very short.
Why does this technique work? Well, basically texture coordinates are 4-component vectors, with the S & T coordinates you are used to divided by the Q coordinate (the 4th component). By having a smaller Q component, the division makes the S & T effectively larger, which means more texture in a small area, which looks "farther away".

But the real magic is in how interpolation happens. When you specify a per-vertex variable that is "varying", it is interpolated across the polygon. In our case, the Q coordinate is interpolated, creating a pattern of shrinking Q's that stretches the texture dynamically. The Y=1/X shape of this curve creates "perspective".

Creating Perspective Via Matrices

This snippet of code is the nuclear weapon of perspective. You simply give it a trapezoid (four corners) in 2-d and it calculates a matrix that applies the texture in a manner that looks like the appropriate perspective. In other words, it's effectively calculating the vanishing points.

We use this in X-Plane for the skewed instruments - when the user drags the instrument corners, we simply calculate a matrix that creates the correct perspective.

The snippet of code contains comments with the derivation - one interesting property: given 8 variables in a 2-d perspective matrix and 8 variables (four 2-d corners) there can exist only one perspective matrix for any given trapezoid.

One warning if you decide to use this with OpenGL: make sure your computed transform matrix doesn't produce negative W coordinates! (To do this, simply apply the resulting matrix to one of your input corners and check the resulting W. If it is negative, negate the entire matrix.)

I use LAPACK to solve the matrix in my version of this code, and LAPACK doesn't make a lot of guarantees for the "constant" term (a constant factor applied to all matrix values). If that constant is negative, you will get output coordinates where X, Y, Z and W are all negative - the perspective division would cancel this negation, but at least on the machines I've used, the negative "W" causes anarchy long before we get that far.

The reason you can use a projection matrix anywhere in OpenGL is because projection fundamentally relies on storing information in the "W" coordinate (the 4th coordinate) to be divided later - the "W" just gets carried along for the ride and then later produces the correct perspective effect.

Thursday, July 10, 2008

Sound and the 24" iMac

I finally found info on how to make it happen here.

Here's the short version:
  • I used Ubuntu 7.10, which has Alsa 1.14, which is a little different.
  • Make sure
    options snd-hda-intel model=bbp3
    is somewhere in your modprobe.d files, like in alsa-base.
  • Turn on a lot of goo in the mixer.
  • Surprisingly the built-in Mic is the digital input!
I now know that XSquawkBox Linux sound does work (but only if X-pane itself is run with --no_sound to allow OSS to open).

Tuesday, July 01, 2008

I Love CGAL

Because CGAL does many magic geometry operations for me that (having tried to code myself) are a PITA to implement.

But every now and then you end up with a link error like this:
"CCBToPolygon(CGAL::I_Filtered_const_iterator > > > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > > > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > > > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > > > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, CGAL::Arrangement_2 > >, std::vector > > >, std::allocator > > > > >, CGAL::Arr_segment_traits_2 > > > >, CGAL::Arr_extended_dcel > >, std::vector > > >, std::allocator > > > > >, CGAL::Arr_segment_traits_2 > > > >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > > > >, CGAL::Arr_halfedge_base > > > >, CGAL::Gps_face_base> >::_Is_valid_halfedge, CGAL::CGALi::In_place_list_iterator > > > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > > > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > > > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > > > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, CGAL::Arrangement_2 > >, std::vector > > >, std::allocator > > > > >, CGAL::Arr_segment_traits_2 > > > >, CGAL::Arr_extended_dcel > >, std::vector > > >, std::allocator > > > > >, CGAL::Arr_segment_traits_2 > > > >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > > > >, CGAL::Arr_halfedge_base > > > >, CGAL::Gps_face_base> >::Halfedge, int, std::bidirectional_iterator_tag>, Polygon2&, std::vector >*, double (*)(CGAL::I_Filtered_iterator > > > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > > > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > > > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > > > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, CGAL::Arrangement_2 > >, std::vector > > >, std::allocator > > > > >, CGAL::Arr_segment_traits_2 > > > >, CGAL::Arr_extended_dcel > >, std::vector > > >, std::allocator > > > > >, CGAL::Arr_segment_traits_2 > > > >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > > > >, CGAL::Arr_halfedge_base > > > >, CGAL::Gps_face_base> >::_Is_valid_halfedge, CGAL::Arrangement_2 > >, std::vector > > >, std::allocator > > > > >, CGAL::Arr_segment_traits_2 > > > >, CGAL::Arr_extended_dcel > >, std::vector > > >, std::allocator > > > > >, CGAL::Arr_segment_traits_2 > > > >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > > > >, CGAL::Arr_halfedge_base > > > >, CGAL::Gps_face_base> >::Halfedge, int, std::bidirectional_iterator_tag>), Bbox2*)", referenced from:
FaceToComplexPolygon(CGAL::I_Filtered_const_iterator > > > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > > > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > > > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > > > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, CGAL::Arrangement_2 > >, std::vector > > >, std::allocator > > > > >, CGAL::Arr_segment_traits_2 > > > >, CGAL::Arr_extended_dcel > >, std::vector > > >, std::allocator > > > > >, CGAL::Arr_segment_traits_2 > > > >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > > > >, CGAL::Arr_halfedge_base > > > >, CGAL::Gps_face_base> >::_Is_valid_face, CGAL::CGALi::In_place_list_iterator > > > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > > > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > > > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > > > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, CGAL::Arrangement_2 > >, std::vector > > >, std::allocator > > > > >, CGAL::Arr_segment_traits_2 > > > >, CGAL::Arr_extended_dcel > >, std::vector > > >, std::allocator > > > > >, CGAL::Arr_segment_traits_2 > > > >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > > > >, CGAL::Arr_halfedge_base > > > >, CGAL::Gps_face_base> >::Face, int, std::bidirectional_iterator_tag>, std::vector >&, std::vector >, std::allocator > > >*, double (*)(CGAL::I_Filtered_iterator > > > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > > > >, GIS_halfedge_data>, CGAL::Arr_extended_face >, std::allocator > > > >, GIS_vertex_data>, CGAL::Arr_extended_halfedge > > > >, GIS_halfedge_data>, CGAL::Arr_extended_face > > >, CGAL::Arrangement_2 > >, std::vector > > >, std::allocator > > > > >, CGAL::Arr_segment_traits_2 > > > >, CGAL::Arr_extended_dcel > >, std::vector > > >, std::allocator > > > > >, CGAL::Arr_segment_traits_2 > > > >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > > > >, CGAL::Arr_halfedge_base > > > >, CGAL::Gps_face_base> >::_Is_valid_halfedge, CGAL::Arrangement_2 > >, std::vector > > >, std::allocator > > > > >, CGAL::Arr_segment_traits_2 > > > >, CGAL::Arr_extended_dcel > >, std::vector > > >, std::allocator > > > > >, CGAL::Arr_segment_traits_2 > > > >, GIS_vertex_data, GIS_halfedge_data, GIS_face_data, CGAL::Arr_vertex_base > > > >, CGAL::Arr_halfedge_base > > > >, CGAL::Gps_face_base> >::Halfedge, int, std::bidirectional_iterator_tag>), Bbox2*)in MapAlgs.o

Saturday, May 31, 2008

How To Drive CVS Totally Insane

Okay, I do have to admit: posting on why CVS sucks or how to make it slow is the lamest thing ever.  I mean, duh!  Everyone knows that CVS is 1970's technology, 3 parts dope to 1 part disco, and there are better alternatives.

But...I think the performance situation I found is interesting because of when it happens.  (In particular, this performance problem has been with us for months and I only noticed now.)

The airport database is a 40 MB, 800,000 line text file.  We checked it into CVS

Hey, stop laughing...it seemed like a good idea at the time!

What's interesting is that CVS can (or at least could) check revisions onto the main branch fast enough that we didn't realize we were getting into trouble, and check out the head branch at linear speeds, limited only by net connection.

Poking into the repository, it turns out that this is because the head-branch revision is stored in its entirety with diff instructions to move back a step each way toward the first revision. 

This goes a ways toward explaining why CVS doesn't become incrementally slower at getting code as you use it: in truth it does, but the cost is born on the older versions of the repository, not the newer ones!

Where we got into trouble was when I checked data into a branch.  The branch cannot be the head revision, so it is stored as a delta off the version it came from.  With a huge file, the cost of applying even one diff on the fly is quite large.

In our case, getting the head revision takes less than a minute; getting either the previous revision of this file or the branched latest version take over 45 minutes!

I am not yet sure what we can do about this - the fact that historical check-outs will be slow is annoying, but clearly we don't use that feature very often.  (Fortunately this is a rarely-changing file.)

Sunday, May 25, 2008

Pitfalls of Performance-Tuning OpenGL

Performance-profiling OpenGL is more difficult than performing a non-OpenGL application because:
  1. OpenGL uses a pipelined architecture; the slowest component will slow performance while other parts of the system go idle and
  2. You don't always have good insight into what's going on inside the OpenGL pipeline.
(Toward this second point there are tools like NVidia's PerfHUD and GLExpert that can tell you this, but your regular adaptive sampling profiler won't help you much.)

Pitfall 1: Not Being Opportunistic

The key to opportunistic performance tuning is to look at your whole application - creating a specialized "test case" to isolate performance problems can be misleading. For example, in X-Plane our break-down of main-thread CPU use might be roughly this:
  • Flight model/physics: 10%.
  • 2-d Panel: 5%.
  • 3-d World Rendering: 85%.
This is telling us: the panel doesn't matter much - it's a drop in the bucket. Let's say I make the panel code twice as fast (a huge win). I get a 2.5% performance boost. Not worth much. But if I make the world rendering twice as fast I get a 73% performance boost.

The naive mistake is to stub out the physics and world rendering to "drill down" into panel. What the profiler is saying is: don't even bother with the panel, there are bigger fish to fry.

Pitfall 2: Non-Realistic Usage and the Pipeline

The first pitfall is not OpenGL specific - any app can have that problem. But drill-down gets a lot weirder when we have a pipeline.

The key point here is: the OpenGL pipeline is as slow as the slowest stage - all other stages will go idle and wait. (Your bucket brigade is as slow as the slowest member.)

Therefore when you stub out a section of code to "focus" on another and OpenGL is in the equation, you do more than distort your optimization potential; you distort the actual problem at hand.

For example, imagine that the sum of the pane and scenery code use up all of the GPU's command buffers (one phase of the pipeline) but pixel fill rate is not used up. When you comment out the scenery code, we use less command buffers, the panel runs faster, and all of a sudden the panel bottlenecks on fill rate.

We look at our profiler and go "huh - we're fill rate bound" and try to optimize fill rate in the panel through tricks like stenciling.

When we turn the scenery engine back on, our fill rate use is even lower, but since we are now bottlenecked on command-buffers, our fill rate optimization gets us nothing. We were mislead because we optimized the wrong bottleneck. We saw the wrong bottleneck because the bottleneck changed when we removed a part of real use code.

One non-obvious case of this is framerate itself; high frame-rate can cause the GPU to bottle-neck on memory bandwidth and fill rate, as it spends time simply "flipping" the screen over and over again. It's as if there's an implicit full-screen quad drawn per frame; that's a lot of fill for very little vertex processing - as the ratio of work per frame to number of frames changes, that "hidden quad" starts to matter.

So as a general rule, load up your program and profile graphics in real-world scenario, not at 300 fps with too little work (or 3 fps with too much work).

Pitfall 3: Revealing New Bottlenecks

There's another way you can get in trouble profiling OpenGL: once you make the application faster, OpenGL load shifts again and new bottlenecks are revealed.

As an example, imagine that we are bottlenecked on the CPU (typical) but pixel fill rate is at 90% capacity, and other GPU resources are relatively idle. We go to improve CPU peformance (becaues it's the bottleneck) but as a result, we optimize too far and as a result we bottleneck completely on fill rate; we get to see only a fraction of our CPU work.

Now this isn't the worst thing; there will always be "one worst problem" and you can then optimize fill rate. But it's good to know how much benefit you'll get for an optimization; some optimizations are time consuming or have a real cost in code complexity, so knowing in advance that you won't get a big win is useful.

Therefore there is a case of stubbing that I do recommend: stubbing old code to emulate the performance profile of new code. This can be difficult -- for example, if you're optimizing CPU that emits geometry, you can't just stub the code or geometry goes down. But when you can find this case, you can get a real measurement of what kind of performance win is possible.