Previously I suggested using a Delaunay triangulation as a spatial index to find "squatters" when simplifying an arrangement. If we want to simplify a triangulation itself, then the triangulation is the spatial index.
Consider a constrained Delaunay triangulation, where our arrangement edges have been replaced with triangulation constraints, and there are no free vertices (nor are there vertices in the triangulation that don't have at least one incident constraint).
We can now use an idea from the previously referenced paper: given a pair of constraints forming a curve pqr that we want to simplify into pr, if there exist any vertices that might be on the edge or in the interior of triangle pqr, then they must be adjacent to q in the triangulation (and between pq and pr on the accute side of pqr).
This means that we can simply circulate vertex q to search for squatters. The triangulation is the index.
Why does this work? Well, consider the case where there exists a vertex X inside triangle PQR that is not adjacent to Q. You can't have free vertices in a triangulation; the act of triangulating out X is going to create at least one link between Q and X; the only way that this will not happen is if there is already some other point inside PQR that is closer to Q than X (and in that case, we fail for that other point).
The triangulation also has the nice property that when we remove a vertex X, we can reconsider its adjacent vertices to see if X was a squatter of those other vertices. This works because if X is a squatter of Q and there are no other squatters (thus removing X "unlocks" Q) then X and Q must be connected.
Implementation Notes
In my case I have one other implementation consideration besides the usual requirements: in my case, I have a many-to-many link between vertices in my original arrangement and vertices in my triangulation. Some triangulation vertices will not have original nodes because they represent subdivision of the triangulation to improve elevation data. And some original arrangement vertices will not be in the triangulation due to simplification of the triangulation's constraints.
The problem is: how do we work backward from a triangulation triangle to an original arrangement face? Given a triangle with a constraint on one side, we need to figure out what arrangement halfedge(s) it links to.
In order to keep this "back-link" unambiguous, we cannot remove all of the degree 2 vertices from a poly-line of original edge segments. We need to leave at least one "poly-line interior" vertex in place to disambiguate two paths between vertices in the arrangement. (This case happens a lot when we have closed loops.)
In practice, we could never remove the poly-line interior vertices from all paths anyway (because they would collapse to zero paths) but in practice, we don't want to remove them from any poly-line because it makes resolving the original arrangement face more difficult.
Thursday, May 26, 2011
Mesh Simplification Part II - Arrangement Simplification
In my previous post, I suggested that we can iteratively simplify an arrangement if we can test a node's degree, the pre-existence of the simplifying edge we want to replace it with, and confirm that there are no "squatting" vertices inside the triangle formed by the two old edges and the one new one.
To simplify an arrangement, therefore, what we really need is a good spatial index to make searching for squatters fast.
Previously I had used a quadtree-like structure, but I seem to be getting better results using a Delaunay triangulation. (This idea is based on the CGAL point_set_2 class).
Implementation Details
For my implementation using arrangements, there are a few quirks:
Since the method recycles two of the four half-edges in the merge, if the first half of the curve points in the opposite direction of the merged curve, the merge is changing the half-edge's direction.
Could this case happen if the merged edge had the same path as the original two edges? No. In order for the direction to change, the two underlying curves cannot be summed to a single curve that is still x-monotone, which is a requirement for CGAL's arrangement.
To simplify an arrangement, therefore, what we really need is a good spatial index to make searching for squatters fast.
Previously I had used a quadtree-like structure, but I seem to be getting better results using a Delaunay triangulation. (This idea is based on the CGAL point_set_2 class).
- We insert every vertex of our arrangement into a Delaunay triangulation.
- When we want to check for squatters, we find the minimum circle enclosing the triangle pqr (where pqr is the curve pair we want to simplify to pr) and search the triangulation for nodes inside the circle.
Implementation Details
For my implementation using arrangements, there are a few quirks:
- I use my own point set; CGAL's point set uses a stack-based depth-first search that tends to flood the stack for large data sets.
- I do not re-queue previously "blocked" points as squatters are removed. This would be a nice feature to add at some point (but is not easily added with a mere index).
- I abuse CGAL's "merge_edge" routine to do the simplification. Edge merge was meant for collinear curves; in my case I pre-ensure that it is a safe operation. The advantage of using merge_edge vs. actually inserting the new edges and removing the old ones is speed and stability: no faces are created or removed, thus face data stays stable, and no geometry tests are needed to determine what holes go in what face, etc.
- Because I am edge-merging, I can't merge two edges that have opposite x-monotone "direction" - thus some details won't get simplified. This is a limitation of CGAL's arrangement interface.
Since the method recycles two of the four half-edges in the merge, if the first half of the curve points in the opposite direction of the merged curve, the merge is changing the half-edge's direction.
Could this case happen if the merged edge had the same path as the original two edges? No. In order for the direction to change, the two underlying curves cannot be summed to a single curve that is still x-monotone, which is a requirement for CGAL's arrangement.
Mesh Simplification Part I - It's All About Squatters
I've been working on map simplification for a few days now - it seems like I periodically have to revisit this problem. After working on simplification yet again, I realized that the problem statement is even simpler than I realized.
Given an arrangement (that is, a set of line segments and possibly free points such that line segments don't cross or end in each other's interiors) we can iteratively simplify the map by replacing adjacent pairs of line segments with a "short-cut" (e.g. replace line segments pq and qr with pr) given the following conditions:
Given this, we can build an iterative algorithm for simplifying a mesh:
The Zone Is Not What We Want
Previously I had coded similar logic via a zone visiting calculation - that is, finding every face, line and point that the edge pr would intersect. This had a few problems:
Given an arrangement (that is, a set of line segments and possibly free points such that line segments don't cross or end in each other's interiors) we can iteratively simplify the map by replacing adjacent pairs of line segments with a "short-cut" (e.g. replace line segments pq and qr with pr) given the following conditions:
- The degree of vertex q is 2 (e.g. only pq and qr emerge from q).
- Line segment pr is not already in the arrangement.
- If p, q, and r are not collinear are no points in the interior of triangle pqr (nor directly between p and r). By definition there can't be any points on pq and qr.
- There is some kind of island geometry (or free vertex) and it will be on the wrong side of pqr after simplification, or
- The geometry "connects" to the world outside pr and pr will intersect at least one segment.
Given this, we can build an iterative algorithm for simplifying a mesh:
- Put every vertex that passes these tests into a queue, based on the error introduced by removing it.
- Remove the first vertex.
- Requeue neighboring vertices based on changed error metrics.
The Zone Is Not What We Want
Previously I had coded similar logic via a zone visiting calculation - that is, finding every face, line and point that the edge pr would intersect. This had a few problems:
- Arrangement zone calculations are really expensive. Given a simple polygon with X sides, we may have to do as many as X zone calculations (if any vertex is eligible for removal) and the zone calculation iterates the polygon boundary. Thus we have an O(N^2) calculation, which is really painful for large polygons made of a large number of small sides. (Sadly, that is precisely what my data tends to be.)
- The zone calculation is wrong; even if we don't crash into anything while computing the zone, if the zone has holes that would be on inside of triangle PQR then we still should not be simplifying. So we would have to iterate over holes as well as calculate the zone.
Tuesday, May 24, 2011
Instancing Limits
I posted some instancing numbers a while ago. As we keep bashing on things, we've found that the upper limit for instanced meshes on a modern Mac with an ATI card appears to be about 100k instanced batches.
There is definitely a trade-off between the number of "actual" batches (e.g. the number of actual draw calls into the driver) and the number of instanced meshes. So we're looking at how to best trade off larger clumps of meshes (fewer driver calls) with smaller ones (less extra drawing when most of the clump is off screen and could have been culled.
There is also a point at which it's not worth using instancing: if the number of objects in the instanced batch is really low, it's quicker to use immediate mode instancing and call draw multiple times. We're not precisely sure where that line is, but it's low - maybe 2 or 3 batches.
(Note that using client arrays to draw an instanced batch where the instancing data is in system memory appears to be a non-accelerated case on 10.6.x - if the instance data isn't in a VBO we see performance fall over and die.)
There is definitely a trade-off between the number of "actual" batches (e.g. the number of actual draw calls into the driver) and the number of instanced meshes. So we're looking at how to best trade off larger clumps of meshes (fewer driver calls) with smaller ones (less extra drawing when most of the clump is off screen and could have been culled.
There is also a point at which it's not worth using instancing: if the number of objects in the instanced batch is really low, it's quicker to use immediate mode instancing and call draw multiple times. We're not precisely sure where that line is, but it's low - maybe 2 or 3 batches.
(Note that using client arrays to draw an instanced batch where the instancing data is in system memory appears to be a non-accelerated case on 10.6.x - if the instance data isn't in a VBO we see performance fall over and die.)
I Hate C++ Part 857: But I Already Had Coffee!
Sigh...
Clearly it's time to switch to espresso.
(ioMesh.is_edge(pts[pts.size()-2],pts[pts.size()-1]),h,vnum)
Clearly it's time to switch to espresso.
Wednesday, May 18, 2011
Performance Tuning Cars
I took a few hours to performance tune X-Plane 10's cars. I must admit, this really isn't what I am supposed to be doing, but I can't resist performance tuning, and the cars touch a number of different scenery subsystems at once.
Initial Tests
I ran a few initial tests to understand the performance problems:
Initial Findings
The sim was somewhat limited based on car AI (about 20 ms per frame), and heavily bounded on car headlights (we were pushing 500,000+ headlights at about 7-8 fs). The actual 3-d physical cars were a non-issue: very few due to limited visibility distance, and they all hit the instancing path (which has a huge budget on DX11 hardware).
The major performance hits were:
The sim transforms and builds deferred "spill" lights even in the forward renderer. This is hugely wasteful, as these lights are just thrown out. And getting rid of it nearly doubles draw performance. (There's another bit of dumb work - the car headlights are fully running during the day. I'll wait on that one; the problem is that X-Plane's lighting abstraction leaves "on/off during the day" totally to the GPU during v10, so we don't eliminate a ton of work. I'll leave it to later to decide how much to "expose" the implementation to cull out lights that are off during the day.)
Another "wasted work" note: the spill lights are still transformed in HDR mode, but when we actually need them, things are really bad - about 5 fps. CPU use is only 70%, which is to say, 500,000 deferred lights is a lot of lights, even for a deferred renderer. (It may also be that the 2 million vertices pushed down to make this happen is starting to use up bus bandwidth.) So we can consider a few more optimizations later:
Similarly, threading the build-out of VBOs is going to be dependent on driver capability. If the driver can't accept mapping a buff on a worker and unmapping on a main thread, then we're going to have problems keeping the worker thread independent without pre-allocating a lot of VBO memory.
Cutting down the LOD distance of deferred lights from 20 km to 5 km takes us from 2.1 million deferred light vertices at 5 fps to 128k vertices at 10 fps. We can even go down to 2 km for 12 fps.
Inlining
Another thing that the Shark profiles showed was that the cars effectively generated some hot loops that had to call non-inlined accessor functions. I had a note to examine inlining at some point, but in this case a few specific accessors "popped". Inlining is a trade-off between keeping code clean and encapsulated and letting the compiler boil everything down.
In this case, I chose to use macros to control whether an inline code file is included from header or translation unit. X-Code still sees the inlines as a dependency, but we get faster compile and better GDB behavior in debug mode, plus a clean public header.
Inlining didn't change the performance of drawing car headlights (whose code path was mostly inline already) but it made a significant difference in the car AI (that is, the code that decides where the cars will drive to) - that code had to update the quadtree using non-inline accessors; with 44k cars navigating we went from 41 fps. (Also notable: the sim will run at 73 fps in the exact "AI" stress case but with cars off - that's 13 ms for basic drawing and the flight model and 11 ms for cars. There's still a lot to be had here.)
When the inlining dust settles, we have on-CPU headlight transform taking a lot of time at the low level, and reculling the scene graph when moving cars still showing up prominently.
More Scrubbing
At this point we have to look at the low level headlight transformer in x86 assembly. Wow, that's not pretty - it looks like someone took a scrabble set, emptied it on the floor, and then hit it repeatedly with a hammer. We can pull some conditional logic out of the tight loop for a small win (about 5%) but even better: the sim is trying to "modulate" the phase of headlights per headlight. This is a clever technique to authors, but totally stupid for headlights because they don't flash. Pull that out and we are getting 695k headlights at 14.5 fps. There's no subtitute for simply not doing things.
The assembly is spending a surprising amount of its time in the matrix transform. It's a rare case where SSE can be a win - in this case, we can squeeze another 15% out of it. Note that before I spent any time on SSE, I did an L2 profile - this shows the code's time by who is missing the L2 cache. The hot loop for transform didn't show up at all, indicating that the time was not being spent waiting for memory. This surprised me a little bit, but if the memory subsystem is keeping up, trying to cram more instructions through the CPU can be a win.
Now you might think: why not just push the work onto another core, or better yet the GPU? The answer is that the particular problem doesn't play well with the more performant APIs. We do have half a million transforms to do, but:
One last note on the scrubbing: it really demonstrated the limits of GCC's optimizer. In particular, if a call tree is passing param values that always branch one way, but the call sight at which the constant is passed is not inlined with the actual if statement, I can get a win by "specializing" the functions myself. From what I can tell, GCC can't "see" that far through the function tree. This is with GCC 4.0 and no guided profiling, so it may be there are better tools. I also haven't tried LLVM as a back-end yet; LLVM supposedly is quite clever with inference.
Multi-Core = Free?
There's one last trick we can pull: we can move the per-frame car AIs off onto another thread that runs parallel to the flight model. On a multi-core machine this is "free" processing if the worker threads are not saturated. When looking "down" (so the AI is the only cost) this takes us from 65 to 80 fps. In a forward view with 20 km visibility, we are (after all of our work on CPU) possibly limited by bus bandwidth - CPU use is "only" 85%. This is important because in this configuration, removing the AI work won't show up as a fps change. If we reduce the car distance from 20km t0 10km for drawing (we still pay for AI all the time) we still get a few fps back.
Follow-Up: Final SSE Numbers
I wrote the rest of this post about a week and a half ago. Today I cleaned up some of my SSE code, the result being that I can turn SSE on and off for the same matrix primitives.
Putting the sim into a "car-heavy" state causes 23% of frame-time to be spent on car headlights; using SSE improves overall frametime by 6% (that is, we get a 6% net boost in fps). This implies that the actual car headlights (currently the only SSE-capable code) becomes 26% faster. Since the headlights are only partly math ops, the actual matrix transforms are significantly faster. (Note the baseline still uses non-SIMD single-float SSE as the math ABI.)
So what can we conclude:
Initial Tests
I ran a few initial tests to understand the performance problems:
- Cars on max, vis distance on max, other parts of the sim "dumbed down" to isolate car costs.
- True framerate measured, including < 19 fps.
- Looked at 10.5.8 and 10.6.7 to make sure analysis on 10.5.8 wasn't biased by driver performance (which is way better on 10.6.7).
- Looked at paused forward view, which isolates car drawing (no car AI when paused), and no-pause down view, which isolates car AI (no drawing when the world is culled).
- Sharked both configs, time profile, all thread states, focused on the main thread.
Initial Findings
The sim was somewhat limited based on car AI (about 20 ms per frame), and heavily bounded on car headlights (we were pushing 500,000+ headlights at about 7-8 fs). The actual 3-d physical cars were a non-issue: very few due to limited visibility distance, and they all hit the instancing path (which has a huge budget on DX11 hardware).
The major performance hits were:
- AI: time to edit the quad tree when cars are moved. Since there is already a cache on this, that means that what's left of this op must be really slow.
- The quad tree editing requires access to some per-object properties that aren't inlined.
- Drawing: the transform time for car headlights.
The sim transforms and builds deferred "spill" lights even in the forward renderer. This is hugely wasteful, as these lights are just thrown out. And getting rid of it nearly doubles draw performance. (There's another bit of dumb work - the car headlights are fully running during the day. I'll wait on that one; the problem is that X-Plane's lighting abstraction leaves "on/off during the day" totally to the GPU during v10, so we don't eliminate a ton of work. I'll leave it to later to decide how much to "expose" the implementation to cull out lights that are off during the day.)
Another "wasted work" note: the spill lights are still transformed in HDR mode, but when we actually need them, things are really bad - about 5 fps. CPU use is only 70%, which is to say, 500,000 deferred lights is a lot of lights, even for a deferred renderer. (It may also be that the 2 million vertices pushed down to make this happen is starting to use up bus bandwidth.) So we can consider a few more optimizations later:
- Provide a cutoff distance for spawning deferred lights from dynamic scenery elements. Since we've measured the distance on these elements anyway (to decide if we want the 3-d car) we could choose to strip out deferred lights.
- We may want to migrate the deferred lights into geometry shaders and/or instancing to cut down on-CPU transform time and bus bandwidth.
- We may want to "prep" streamed light VBOs on a worker thread.
Similarly, threading the build-out of VBOs is going to be dependent on driver capability. If the driver can't accept mapping a buff on a worker and unmapping on a main thread, then we're going to have problems keeping the worker thread independent without pre-allocating a lot of VBO memory.
Cutting down the LOD distance of deferred lights from 20 km to 5 km takes us from 2.1 million deferred light vertices at 5 fps to 128k vertices at 10 fps. We can even go down to 2 km for 12 fps.
Inlining
Another thing that the Shark profiles showed was that the cars effectively generated some hot loops that had to call non-inlined accessor functions. I had a note to examine inlining at some point, but in this case a few specific accessors "popped". Inlining is a trade-off between keeping code clean and encapsulated and letting the compiler boil everything down.
In this case, I chose to use macros to control whether an inline code file is included from header or translation unit. X-Code still sees the inlines as a dependency, but we get faster compile and better GDB behavior in debug mode, plus a clean public header.
Inlining didn't change the performance of drawing car headlights (whose code path was mostly inline already) but it made a significant difference in the car AI (that is, the code that decides where the cars will drive to) - that code had to update the quadtree using non-inline accessors; with 44k cars navigating we went from 41 fps. (Also notable: the sim will run at 73 fps in the exact "AI" stress case but with cars off - that's 13 ms for basic drawing and the flight model and 11 ms for cars. There's still a lot to be had here.)
When the inlining dust settles, we have on-CPU headlight transform taking a lot of time at the low level, and reculling the scene graph when moving cars still showing up prominently.
More Scrubbing
At this point we have to look at the low level headlight transformer in x86 assembly. Wow, that's not pretty - it looks like someone took a scrabble set, emptied it on the floor, and then hit it repeatedly with a hammer. We can pull some conditional logic out of the tight loop for a small win (about 5%) but even better: the sim is trying to "modulate" the phase of headlights per headlight. This is a clever technique to authors, but totally stupid for headlights because they don't flash. Pull that out and we are getting 695k headlights at 14.5 fps. There's no subtitute for simply not doing things.
The assembly is spending a surprising amount of its time in the matrix transform. It's a rare case where SSE can be a win - in this case, we can squeeze another 15% out of it. Note that before I spent any time on SSE, I did an L2 profile - this shows the code's time by who is missing the L2 cache. The hot loop for transform didn't show up at all, indicating that the time was not being spent waiting for memory. This surprised me a little bit, but if the memory subsystem is keeping up, trying to cram more instructions through the CPU can be a win.
Now you might think: why not just push the work onto another core, or better yet the GPU? The answer is that the particular problem doesn't play well with the more performant APIs. We do have half a million transforms to do, but:
- Since the transform is going straight into AGP memory, filling the buffers with multiple threads would require OpenGL sync code - we may get there some day, but that kind of code requires a lot of in-field validation to prove that the entire set of drivers we run on (on three operating systems) handles the case both correctly and without performance penalties.
- The vertex data is generated off an irregular structure, making it difficult to put on the GPU. (This is definitely possible now with the most programmable cards, but it wouldn't run on our entire set of required hardware.)
One last note on the scrubbing: it really demonstrated the limits of GCC's optimizer. In particular, if a call tree is passing param values that always branch one way, but the call sight at which the constant is passed is not inlined with the actual if statement, I can get a win by "specializing" the functions myself. From what I can tell, GCC can't "see" that far through the function tree. This is with GCC 4.0 and no guided profiling, so it may be there are better tools. I also haven't tried LLVM as a back-end yet; LLVM supposedly is quite clever with inference.
Multi-Core = Free?
There's one last trick we can pull: we can move the per-frame car AIs off onto another thread that runs parallel to the flight model. On a multi-core machine this is "free" processing if the worker threads are not saturated. When looking "down" (so the AI is the only cost) this takes us from 65 to 80 fps. In a forward view with 20 km visibility, we are (after all of our work on CPU) possibly limited by bus bandwidth - CPU use is "only" 85%. This is important because in this configuration, removing the AI work won't show up as a fps change. If we reduce the car distance from 20km t0 10km for drawing (we still pay for AI all the time) we still get a few fps back.
Follow-Up: Final SSE Numbers
I wrote the rest of this post about a week and a half ago. Today I cleaned up some of my SSE code, the result being that I can turn SSE on and off for the same matrix primitives.
Putting the sim into a "car-heavy" state causes 23% of frame-time to be spent on car headlights; using SSE improves overall frametime by 6% (that is, we get a 6% net boost in fps). This implies that the actual car headlights (currently the only SSE-capable code) becomes 26% faster. Since the headlights are only partly math ops, the actual matrix transforms are significantly faster. (Note the baseline still uses non-SIMD single-float SSE as the math ABI.)
So what can we conclude:
- Using SSE for our matrix ops is a big win for actual time spent doing matrices.
- X-Plane doesn't have a lot of CPU time in this category.
Tuesday, May 17, 2011
SSE? It's the Memory, Stupid
One last SSE note: I went to apply SSE optimizations to mesh indexed matrix transforms. While applying some very simple SSE transforms improved throughput 15%, that gain went away when I went for a more complex SSE implementation that tried to avoid the cost of unaligned loads.
Surprising? Well, when I Sharked the more complete implementation it was clear that it was bound up on memory bandwidth. Using the CPU more efficiently doesn't help much if the CPU is starved for data.
Surprising? Well, when I Sharked the more complete implementation it was clear that it was bound up on memory bandwidth. Using the CPU more efficiently doesn't help much if the CPU is starved for data.
Seriosly Strange Execution?
This is a post in which I try to document what I have learned in SSE 101; if you want to make fun of me for having worked on a flight simulator for five years without writing SSE code*, go ahead now; I'll wait.
Okay then. The last time I looked at SIMD code was with Altivec; having come from PPC code I'm only barely getting used to this whole "little endian" thing, let alone the mishmash that is x86 assembler.
So a __m128 looks a lot like a float[4], and it's little endian, so if I do something like this:
then GDB tells me that aa contains 0, 1, 2, 3 in those "slots". And a memory inspector shows 0 in the lowest four bytes. So far so good.
Then I do this:
and I get 1,3,3,3 in rising memory in cc.
Wha?
Well, we can actually tease that one apart.
So that's actually completely logical, as long as you understand that _MM_SHUFFLE's four arguments come in as bit-value positions, which are always written "backward" on a little endian machine. Naively, I would have reversed the macro order (and there's nothing stopping a programmer from creating a "backward" shuffle macro that reads in "array component" order). While this wouldn't be an issue on a big endian machine, the order of everything would mismatch memory - it's sort of nice that component 0 sits in the low order bits. Really what we need to do is read from right to left!
So I thought I had my head around things, until I looked at the contents of %xmm0. The shuffle code gets implemented in GDB (optimizer off) like this:
If you speak x86, that's like "see spot run", but for those who don't:
When viewed as a 128 bit integer in the debugger, %xmm0 contains:
So what have I learned? Well, if you look at an Intel SSE diagram like this, my conclusion is: component 0 is the same as the low bits in memory, which is the same as the first item of an array. The fact that it is drawn on the right side of the diagram is an artifact of our left-to-right way of writing place-value numbers. (I can only speculate that Intel's Israeli design team must find these diagrams even more byzantine.)
* This is because until now in the X-Plane 10 development cycle, we haven't needed it - X-Plane 10 is the first build to do a fair amount of "uniform" transform on the CPU. If anything that's a step back, because we really should be doing that kind of thing on the GPU.
Okay then. The last time I looked at SIMD code was with Altivec; having come from PPC code I'm only barely getting used to this whole "little endian" thing, let alone the mishmash that is x86 assembler.
So a __m128 looks a lot like a float[4], and it's little endian, so if I do something like this:
float le[4] = { 0, 1, 2, 3 };
__m128 aa = _mm_loadu_ps(le);
then GDB tells me that aa contains 0, 1, 2, 3 in those "slots". And a memory inspector shows 0 in the lowest four bytes. So far so good.
Then I do this:
__m128 cc = _mm_shuffle_ps(aa,aa,_MM_SHUFFLE(3,3,3,1));
and I get 1,3,3,3 in rising memory in cc.
Wha?
Well, we can actually tease that one apart.
- The _MM_SHUFFLE matrix takes its parameters from high to low bits, that is, in binary 3,3,3,1 becomes 11111101 or 0xFD.
- Thus the low two bits of the mask contain the shuffle mask (01) for the low order component of my vector.
- Thus "1" is selected into the lowest component [0] of my array.
So that's actually completely logical, as long as you understand that _MM_SHUFFLE's four arguments come in as bit-value positions, which are always written "backward" on a little endian machine. Naively, I would have reversed the macro order (and there's nothing stopping a programmer from creating a "backward" shuffle macro that reads in "array component" order). While this wouldn't be an issue on a big endian machine, the order of everything would mismatch memory - it's sort of nice that component 0 sits in the low order bits. Really what we need to do is read from right to left!
So I thought I had my head around things, until I looked at the contents of %xmm0. The shuffle code gets implemented in GDB (optimizer off) like this:
movaps %-0x48(%ebp),%xmm0
shufps $0xfd,-0x48(%ebp),%xmm0
movaps %xmm0,-0x28(%ebp)
If you speak x86, that's like "see spot run", but for those who don't:
- %ebp is the stack frame pointer on OS X; with the optimizer off my local __m128 variables have been given aligned storage below the frame pointer as part of the function they sit in. -0x48 is the offset for aa and -0x28 is the offset for cc.
- This is GCC disassembly, so the destination is on the right.
- SSE operations typically work as src op dst -> dst.
- So this code loads aa into %xmm0, shuffles it with itself from memory (the results stay in %xmm0), then write %xmm0 back to cc.
When viewed as a 128 bit integer in the debugger, %xmm0 contains:
128i: 0000803f 00004040 00004040 00004040The memory for CC contains this byte string:
4x32i: 40400000 40400000 40400000 3f800000
16x8i: 40 40 00 00 40 40 00 00 40 40 00 00 3f 80 00 00
4x32f: 3.0 3.0 3.0 1.0
00 00 80 3f 00 00 40 40 00 00 40 40 00 00 40 40I spent about 15 minutes trying to understand what the hell I was looking at, but then took a step back: if a tree falls in a forest and no one can see the trunk without writing it ot to memory, who cares? I wrote some code to do unpacks and low-high moves and sure enough, completely consistent behavior. If you treat an __m128 as an array of four floats, unpack_lo_ps(a,b), for example, gives you { a[0], b[0], a[1], b[1] }.
So what have I learned? Well, if you look at an Intel SSE diagram like this, my conclusion is: component 0 is the same as the low bits in memory, which is the same as the first item of an array. The fact that it is drawn on the right side of the diagram is an artifact of our left-to-right way of writing place-value numbers. (I can only speculate that Intel's Israeli design team must find these diagrams even more byzantine.)
* This is because until now in the X-Plane 10 development cycle, we haven't needed it - X-Plane 10 is the first build to do a fair amount of "uniform" transform on the CPU. If anything that's a step back, because we really should be doing that kind of thing on the GPU.
Wednesday, May 11, 2011
SvnX on OS X 10.6? You Need a Key Pair
A few members of our art team use a mix of the command line and SvnX to move art asset packs around via SVN.
One minor hitch: SvnX can't log into a server that uses svn+ssh as its access method if ssh requires a manually typed password.
The work-around is to establish a private/public key pair for ssh. Once you do that, keychain will offer to store the password, and SvnX can function normally.
In theory sshkeychain should let the key chain remember plain passwords, but I couldn't get this to work on 10.6.
The keypair can be established as follows:
One minor hitch: SvnX can't log into a server that uses svn+ssh as its access method if ssh requires a manually typed password.
The work-around is to establish a private/public key pair for ssh. Once you do that, keychain will offer to store the password, and SvnX can function normally.
In theory sshkeychain should let the key chain remember plain passwords, but I couldn't get this to work on 10.6.
The keypair can be established as follows:
cd ~/.ssh
ssh-keygen -t rsa
(type desired password, accept default file name)
scp id_rsa.pub you@server.com:/home/you/.ssh/auhorized_keys
(where "you" is your unix login name. authorized_keys may need a different name for different servers.)
Saturday, May 07, 2011
The Limits of 8-bit Normal Maps
It's safe to say that when one of the commenters points out something that will go wrong, it's only a matter of time before I find it myself. In this case the issue was running out of normal map precision and it was a matter of having an art asset sensitive to normal maps.
Well, our artists keep making newer and weirder art assets, and once again normal maps are problematic. In particular, when you really tighten up specular hilights, the angular precision per pixel of 8-bit normal maps makes it very difficult to create "small" effects without seeing the quantization of the map.
I made this graph to illustrate the problem:
So what is this? This equation (assuming I haven't screwed it up) shows the fall-off of specular light levels as a function of the "displacement" of the non-tangent channels of a normal map. Or more literally, for every pixel of red you add to move the normal map off to the right, how much less bright does the normal become?
In this case, our light source is hitting our surface dead on, we're in 8-bit, and I've ignored linear lighting (which would make the problems here worse in some cases, better in others). I've also ignored having specularity being "cranked" to HDR levels - since we do this in X-Plane the effects are probably 2x to 3x worse than shown. Units to the right is added dx and dy vectors, and each unit vertically is a loss of brightness value.
Three fall-off curves are shown based on exponents ^128, ^1024, and ^4096. (The steepest and thus most sensitive one is ^4096). You can think of your specular exponent as an "amplifier" that "zooms in" on the very top of the lighting curve, and thus amplifies errors.
So to read this: for the first minimal unit of offset we add to the normal map, we lose about two minimal units of brightness. In other words, even at the top of the curve, with an exponent of ^1024, specular hilights will have a "quantized" look, and a smooth ramp of color is not possible. It gets a lot worse - add that second unit of offset to the normal map and we lose eight units of color!
(By comparison, the more gentle 2^128 specular hilight) isn't as bad - we lose six units of brightness for five of offset, so subtle normal maps might not look too chewed up.)
This configuration could be worse - at least we have higher precision near zero offset. With tangent space normal maps, large areas of near constant normals tend to be not perturbed very much - because if there is a sustained area of the perceived surface being "perpendicular" to the actual 3-d mesh, the author probably should have built the feature in 3-d. (At least, this is true for an engine like X-Plane that doesn't have displacement mapping.)
What can we do?
Well, our artists keep making newer and weirder art assets, and once again normal maps are problematic. In particular, when you really tighten up specular hilights, the angular precision per pixel of 8-bit normal maps makes it very difficult to create "small" effects without seeing the quantization of the map.
I made this graph to illustrate the problem:
So what is this? This equation (assuming I haven't screwed it up) shows the fall-off of specular light levels as a function of the "displacement" of the non-tangent channels of a normal map. Or more literally, for every pixel of red you add to move the normal map off to the right, how much less bright does the normal become?
In this case, our light source is hitting our surface dead on, we're in 8-bit, and I've ignored linear lighting (which would make the problems here worse in some cases, better in others). I've also ignored having specularity being "cranked" to HDR levels - since we do this in X-Plane the effects are probably 2x to 3x worse than shown. Units to the right is added dx and dy vectors, and each unit vertically is a loss of brightness value.
Three fall-off curves are shown based on exponents ^128, ^1024, and ^4096. (The steepest and thus most sensitive one is ^4096). You can think of your specular exponent as an "amplifier" that "zooms in" on the very top of the lighting curve, and thus amplifies errors.
So to read this: for the first minimal unit of offset we add to the normal map, we lose about two minimal units of brightness. In other words, even at the top of the curve, with an exponent of ^1024, specular hilights will have a "quantized" look, and a smooth ramp of color is not possible. It gets a lot worse - add that second unit of offset to the normal map and we lose eight units of color!
(By comparison, the more gentle 2^128 specular hilight) isn't as bad - we lose six units of brightness for five of offset, so subtle normal maps might not look too chewed up.)
This configuration could be worse - at least we have higher precision near zero offset. With tangent space normal maps, large areas of near constant normals tend to be not perturbed very much - because if there is a sustained area of the perceived surface being "perpendicular" to the actual 3-d mesh, the author probably should have built the feature in 3-d. (At least, this is true for an engine like X-Plane that doesn't have displacement mapping.)
What can we do?
- Use some form of normal map compression that takes advantage of the full bit-space of RGB.
- Throw more bits at the problem, e.g. use RG16. (This isn't much fun if you're using the A and B channels for other effects that only need 8 bits.)
- Use the blue channel as an exponent (effectively turning the normal map into some kind of freaky 8.8 floating point). This is something we're looking at now, so I'll have to post back as to whether it helps. The idea is that we can "recycle" the dynamic range of the RG channels when close to dark using the blue channel as a scalar. This does not provide good normal accuracy for highly perturbed normals; the assumption above is that really good precision is needed most with the least offset.
- Put some kind of global gamma curve on the RG channels. This would intentionally make highly perturbed normals worse to get better results at low perturbations. (I think we're unlikely to productize this, but it might in some cases provide better results using only 16 bits.)
- Tell our artists "don't do that". (They never like hearing that answer.)
Sunday, May 01, 2011
Damn You, L2 Cache!!!
So first: this is a good read. Having spent the weekend reading about how import it is not to miss cache and being reminded that having your structs fit in cache lines makes you bad-ass, I was all prepared to score an epic win against the forces of garbage collection.
Let me take a step back.
X-Plane uses a series of custom allocation strategies in places where we know things that the system allocator cannot know (e.g. "all of these blocks have the same life-span", or "these allocations don't have to be thread-safe"), and this got us a win in terms of less CPU time being spent allocating.
X-Plane also uses a quad-tree-like structure to cull our scene-graph. The cull operation is very fast, and so (not surprisingly) when you profile the scene graph, the 'hot spots' in the quad tree are all L2 cache misses. (You can try this on X-Plane 9, just turn down objects to clear out driver time and see in-sim work.) In other words, the limiting factor on plowing through the scene graph is not CPU processing, rather it's keeping the CPU fed with more quad-tree nodes from memory.
The nodes in the quad tree come from one of these custom allocation strategies.
So my clever plan was: modify the custom allocator to try to keep quad-tree nodes together in memory, improving locality, improving cache hits, improving framerate, and proving once again that all of that misery I go through managing my own memory (and chasing down memory scribbles of my own creation) is totally worth it!
Unfortunately, my clever plan made things worse.
It turns out that the allocation pattern I had before was actually better than the one I very carefully planned out. The central problem with most parts of X-Plane's scene graph is: you don't know what "stuff" is going to come out of the user's installed custom scenery, and you don't know precisely what will and won't be drawn. Thus while there is some optimal way to set up the scene graph, you can't precompute it, and you can only come close with heuristics.
In this case the heuristic I had before (allocation order will be similar to draw order) turns out to be surprisingly good, and the allocation order I replaced it with (keep small bits of the scene graph separate so they can remain local within themselves later) was worse.
So...until next time, L2 cache, just know that somewhere, deep in my underground lair, I will be plotting to stuff you with good data. (Until then, I may have to go stuff myself with scotch.)
Let me take a step back.
X-Plane uses a series of custom allocation strategies in places where we know things that the system allocator cannot know (e.g. "all of these blocks have the same life-span", or "these allocations don't have to be thread-safe"), and this got us a win in terms of less CPU time being spent allocating.
X-Plane also uses a quad-tree-like structure to cull our scene-graph. The cull operation is very fast, and so (not surprisingly) when you profile the scene graph, the 'hot spots' in the quad tree are all L2 cache misses. (You can try this on X-Plane 9, just turn down objects to clear out driver time and see in-sim work.) In other words, the limiting factor on plowing through the scene graph is not CPU processing, rather it's keeping the CPU fed with more quad-tree nodes from memory.
The nodes in the quad tree come from one of these custom allocation strategies.
So my clever plan was: modify the custom allocator to try to keep quad-tree nodes together in memory, improving locality, improving cache hits, improving framerate, and proving once again that all of that misery I go through managing my own memory (and chasing down memory scribbles of my own creation) is totally worth it!
Unfortunately, my clever plan made things worse.
It turns out that the allocation pattern I had before was actually better than the one I very carefully planned out. The central problem with most parts of X-Plane's scene graph is: you don't know what "stuff" is going to come out of the user's installed custom scenery, and you don't know precisely what will and won't be drawn. Thus while there is some optimal way to set up the scene graph, you can't precompute it, and you can only come close with heuristics.
In this case the heuristic I had before (allocation order will be similar to draw order) turns out to be surprisingly good, and the allocation order I replaced it with (keep small bits of the scene graph separate so they can remain local within themselves later) was worse.
So...until next time, L2 cache, just know that somewhere, deep in my underground lair, I will be plotting to stuff you with good data. (Until then, I may have to go stuff myself with scotch.)