The Hacks of Life

Wednesday, October 31, 2012

Non-Square OGSSAA+FXAA

I just finished integrating FXAA 3.11 into X-Plane 10; we were using an older FXAA implementation. Timothy Lottes pointed me at the right way to combine OGSSAA and FXAA: run the FXAA sampler in SSAA space and mix down its results on the fly. Conveniently our pipeline was pretty much ready to do this.

I've been looking for an option for OGSSAA other than 4x. One idea would be to use 1.4x1.4 scaling for 2x total fill rate costs, blurry box filtering be damned, but FXAA really needs to know where the individual pixels are.

X-Plane's main problem is temporal anti-aliasing, and most of the anti-aliasing is vertical - that is, there are a lot of long thin horizontal features in the background (roofs of buildings, roads, etc.) that are responsible for the most annoying artifacts.

So I tried an experiment: non-square OGSSAA with FXAA. And...it pretty much works. I'm sure someone has done this before, and frankly I don't have the greatest eye for anti-aliasing, but the extra vertical res really improves image stability.

Secret decoder ring to the images: images with "FX" in the name (or use_post_aa=2) in the caption have FXAA applied. The 2x/4x/8x applies to the OGSSAA sample count; the grid is shown in the pixels. 2x OGSSAA is 1x2, 4x is 2x2, and 8x is 2x4.

The pictures really don't do justice to the improvement that 2x4 gives the image in terms of temporal stability. Having 4x the vertical samples for those thin roofs makes a big difference.

Tuesday, July 24, 2012

Deferred Depth: 3 Ways

I have been updating X-Plane 10's deferred pipeline; when 10.10 is in a later beta and we're sure that the new pipeline is going to hold, I'll write it up in more detail. But for now, a few notes on deferred depth.

Our art guys need more properties in the G-Buffer, so I went looking to see if I could recycle the depth buffer, rather than waste a G-Buffer channel on depth.

The Problem

The problem is this: a deferred renderer typically wants to read depth to reconstruct eye-space position in the lighting pass; position is needed for lighting attenuation and shadows.

But the lighting pass almost certainly also needs the depth buffer to be bound for depth rejection. There are a number of cute G-Buffer tricks that require this

Any kind of depth-based light rejection (whether with a full stencil volume or just based on the bounding volume being occluded) requires the depth buffer of the scene as a real Z buffer.
Soft particles require rejecting Z and sampling (to soften). We really want the hardware Z buffer to cut fill rate!

Copying the Depth Buffer

One simple approach is to copy the depth buffer to a depth texture. In my case, I tried copying the current depth buffer (which is D24/S8) to a GL_depth_component24 texture using glCopyTexSubImage. This worked well and didn't cause performance problems; I guess after a number of years ripping the depth buffer to a texture has finally been ironed out.

With this technique the eye space layer of the G-buffer is another texture, but it comes from a single full-screen copy rather than from an additional MRT color attachment.

Read And Use

A second approach is to simply bind the depth buffer and use it at the same time. This scheme requires GL_NV_texture_barrier (an extension I didn't know about until smarter people clued me in recently) and thus is only available on Windows. In this scheme you:

Set up your D24/S8 depth buffer as a texture attached to depth in your main G-Buffer FBO, rather than a render-buffer. Non-POTS textures is a given for DX10 hardware.
Share this depth buffer with your HDR texture that you "accumulate" light into.
After completing the g-buffer fill pass, call glTextureBarrierNV() to ensure that all write-outs to the depth buffer have completed before the next thing happens.
Turn off depth writing and Z-test and read from the depth buffer at the same time (something that is allowed by the relaxed semantics of the extension).

This saves us the copy and extra VRAM, but assumes that our various post-processing effects don't need to write Z, an assumption that is usually true.

I have not tried this technique; see below for why sharing Z isn't for X-Plane.

Eye-Space Float Z

One simple way to solve the problem (the one X-Plane originally used, and one that is sometimes used on older platforms that won't let you read and sample Z at the same time) is to simply write eye-space Z to part of the G-Buffer. This wastes G-Buffer space, but...

...it is unfortunately necessary for X-Plane. X-Plane draws in two depth domains, one for the world and one for the 3-d cockpit. Thus no one Z buffer contains full position information for the entire screen. In order to avoid two full mix-downs of the G-Buffer, we simply write out eye-space position in floating point, which gives us adequate precision over the entire world.*

If I could find a depth encoding that clips properly, doesn't inhibit early Z, and can span the entire depth range we need, we could use one of the techniques above, but I don't think such a Z technique exists, as X-Plane needs a near clip plane of around 1-5 cm in the cockpit and at least 100k meters to the far clip plane. With D24S8 we're off by quite a few bits.

(I have not had a chance to experiment with keeping the stencil buffer separate yet.)

*Currently the code uses 16-bit float eye space depth, which is apparently faster to fill than 32F on ATI hardware according to some presentation I found. Is it enough precision? I am not sure because I will have to fix other shadow bugs first. But it should be noted that we care a lot about near precision but really not much about far depth, which is only used for fog. If a later post says we use a 32F eye-space Z and not 16F, you'll know what happened.

Wednesday, April 04, 2012

Beyond glMapBuffer

For a while X-Plane has had a performance problem pushing streaming vertices through ATI Radeon HD GPUs on Windows (but not OS X). Our initial investigation showed that glMapBuffer was taking noticeable amounts of time, which is not the case on other platforms. This post explains what we found and the work-around that we are using.

(Huge thanks to ATI for their help in giving me a number of clues! What we did in hindsight seems obvious, but with 536 OpenGL extensions to choose from we might never have found the right solution.)

What Is Streaming?

To be clear on terminology, when I say streaming, I mean a vertex stream going to the GPU that more-or-less doesn't repeat, and is generated by the CPU per frame. We have a few cases of these in X-Plane: rain drops on the windshield, car headlights (since the cars move per frame, we have to respecify the billboards every frame) and the cloud index buffers all fit into this category. (In the case of the clouds, the Z sort is per frame, since it is affected by camera orientation; the puff locations are static.)

In all of these cases, we use an "orphan-and-map" strategy: for each frame we first do a glBufferData with a NULL ptr, which effectively invalidates the contents of the buffer at our current time stamp in the command stream; we then do a glMapBuffer with the write flag. The result of this is to tell the driver that we want memory now and we don't care what's in it - we're going to respecify it anyway.

Most drivers will, in response to this, prepare a new buffer if the current one is still in-flight. The effect is something like a FIFO, with the depth decided by the driver. We like this because we don't actually know how deep the FIFO should be - it depends on how far behind the GPU is and how many GPUs are in our SLI/CrossFire cluster.

Why Would Streaming Be Expensive?

If orphaning goes well, it is non-blocking for the CPU - we keep getting new fresh buffers from the driver and can thus get as far ahead of the GPU as the API will let us. So it's easy to forget that what's going on under the hood has the potential to be rather expensive.* Possible sources of expense:

If we really need a new buffer, that is a memory allocation, by definition not cheap (at least for some values of the word "cheap").
If the memory is new, it may need a VM operation to set its caching mode to uncached/write-combined.
If the driver doesn't want to just spit out new memory all of the time it has to check the command stream to see if the old memory is unlocked.
If the driver wants to be light on address space use, it may have unmapped the buffer, so now we have a VM operation to map the buffer into our address space. (From what I can tell, Windows OpenGL drivers are often quite aggressive about staying out of the address space, which is great for 32-bit games. By comparison, address space use with the same rendering load appears to always be higher on Mac and Linux. 64 bit, here we come!)
If orphaning is happening, at some point the driver has to go 'garbage collect' the orphaned buffers that are now floating around the system.

Stepping back, I think we can say this: using orphaning is asking the driver to implement a dynamic FIFO for you. That's not cheap.

Real Performance Numbers

I went to look at the performance of our clouds with the following test conditions:

100k particles, which means 400k vertices or 800k of index data (we use 16-bit indices).
50 batches, so about 2000 particles per VBO/draw call.
The rest of the scenery system was set quite light to isolate clouds, and I shrunk the particle size to almost nothing to remove fill rate from the calculation.

(In this particular case, we need to break the particles into batches for both Z sorting and culling reasons, and the VBOs aren't shared in a segment-buffer-like scheme due to the scrolling of scenery.)

Under these conditions, on my i5-2500 I saw these numbers. The 0 ms really means < 1 ms, as my timer output is only good +/- 1 ms. (Yes, that sucks...the error is in the UI of the timer, not the timer itself.)

NV GTX 580, 296.xx drivers: 2 ms to sort, 0 ms for map-and-write, 0 ms for draw.
ATI Radeon 7970 12-3 drivers: 2 ms to sort, 6 ms to map, 1 ms to write, 1 ms to plot.

That's a pretty huge difference in performance! The map+write and draw is basically free on NV hardware, but costs 8 ms on ATI hardware.

glMapBufferRange Takes a Lock

In my original definition of streaming I describe the traditional glBufferData(NULL) + glMapBuffer approach. But glMapBufferRanged provides more explicit semantics - you can pass the GL_MAP_INVALIDATE_BUFFER_BIT flag to request a discard-and-map without calling glMapBuffer. What surprised me is that on ATI hardware this performed significantly worse!

It turns out that as of this writing, on ATI hardware you also have to pass GL_MAP_UNSYNCHRONIZED_BIT or the map call will block waiting on pending draw calls. The more backed up your GPU is, the worse the wait; while the 6 ms of map time above is enough to care a lot, blocking on a buffer can cut your framerate in half.

I believe that this behavior is unnecessarily conservative - since I don't see buffer corruption with unsynchronized + invalidation, I have to assume that they are mapping new memory, and if that is the case, the driver should always return immediately to improve throughput. But I am not a driver writer and there's probably fine print I don't understand. There's certainly no harm in passing the unsynchronized bit.

With invalidate + unsynchronized, map-buffer-range has the same performance as bufferdata(NULL)+map-buffer. Without the unsynchronized bit, map-buffer-range is really slow.

Map-Free Streaming with GL_AMD_pinned_memory

Since glMapBufferRange doesn't do any better when using it for orphaning, I tried a different path: GL_AMD_pinned_memory, an extension I didn't even know existed until recently.

The pinned memory extension converts a chunk of your memory into an AGP-style VBO - that is, a VBO that is physically resident in system memory, locked down, write-combined, and mapped through the GART so the GPU can see it. (In some ways this is a lot like the old-school VAR extensions, except that the mapping is tied to a VBO so you can have several in flight at once.) The short of it:

Your memory becomes the VBO storage. Your memory can't be deallocated for the life of the VBO.
The driver locks down your memory and sets it to be uncached/write-combined. I found that if my memory was not page-aligned, I got some really spectacular failures, including but not limited to crashing the display driver. I am at peace with this - better to crash hard and not lose fps when the code is debugged!
Because your memory is the VBO, there is no need to map it - you know where it's base address is and you can just write to it.

This has a few implications compared to a traditional orphan & map approach:

In return for saving map time, we are losing any address space management. The pinned buffer will sit in physical memory and virtual address space forever. So this is good for streaming amounts of data, but on a large scale might be unusable. For "used a lot, mapped occaisionaly" this would be worse than mapping. (But then, in that case, map performance is probably not important.)
Because we never map, there is no synchronization. We need to be sure that we are not rewriting the buffer for the next frame while the previous one is in flight. This is the same semantics as using map-unsynchronized.

On this second point, my current implementation does the cheap and stupid thing: it allocates 4 segments of the pinned buffer, allowing us to queue up to 4 frames of data; theoretically we should be using a sync object to block in the case of "overrun". The problem here is that we really never want to hit that sync - doing so indicates we don't have enough ring buffer. But we don't want to allocate enough FIFO depth to cover the worst CrossFire or SLI case (where the outstanding number of frames can double or more) and eat that cost for all users. Probably the best thing to do would be to fence and then add another buffer to the FIFO every time we fail the fence until we discover the real depth of the renderer.

With pinned memory, mapping is free, the draw costs about 1 ms and the sort costs us 2 ms; a savings of about 6-7 ms!

Copying the Buffer

The other half of this is that we use glCopyBuffer to "blit" our pinned buffer into a static-draw VRAM based VBO for the actual draw. Technically we never know where any VBO lives, but with pinned memory we can assume it's an AGP buffer; this means that we eat bus bandwidth every time we draw from it.

glCopyBufferData is an "in-band" copy, meaning that the copy happens when the GPU gets to it, not immediately upon making the API call. This means that it won't block because the previous draw call that uses the destination buffer hasn't completed.

In practice, for our clouds, I saw better performance without the copy - that is, drawing vertices from AGP was quicker than copying to VRAM. This isn't super-surprising, as the geometry gets used only twice, and it is used in a very fill-rate expensive way (drawing alpha particles).** We lost about 0.5 ms by copying to VRAM.

Sanity Checks

Having improved cloud performance, I then went to look at our streaming light billboard and streaming spill volume code and found that this code was mistuned; the batch size was set low for some old day when we had fewer lights. Now that our artists have had time to go nuts on the lighting engine, we were doing 5000 maps/second due to poor bucketing.

For that matter, the total amount of data being pushed in the stream was also really huge. If there's a moral to this part of the story it is: sometimes the best way to make a slow API fast is to not use it.

Better Than Map-Discard

Last night I read this Nvidia presentation from GDC2012, and it surprised me a little; this whole exercise had been about avoiding map-discard on ATI hardware for performance reasons - on NVidia hardware the driver was plenty fast. But one of the main ideas of the paper is that you can do better than map-discard by creating your own larger ring buffer and using a sub-window. For OpenGL I believe you'd use unsynchronized, discard-range, and the write flags and map in each next window as you fill it.

The win here is that the GPU doesn't actually have to manage more than one buffer; they can do optimal things like just leave the buffer mapped for a while or return scratch memory and DMA it into the buffer later. This idiom is still a map/unmap though, so if the driver doesn't have special optimization to make map fast, it wouldn't be a win.

(That is, on ATI hardware I suspect that ring-buffering your pinned VBO is better than using this technique. But I must admit that I did not try implementing a ring buffer with map-discard-range.)

The big advantage of using an unsynchronized (well, synchronized only by the app) ring buffer is that you can allocate arbitrary size batches into it. We haven't moved to that model in X-Plane because most of our streaming cases come in large and predictable quantities.

* In all of these posts, I am not a driver writer and have no idea what the driver is really doing. But on the one or two times I have seen real live production OpenGL driver code, I have been shocked by how much crud the driver has to get through to do what seems like a cheap API call. It's almost like the GL API is a really high level of abstraction! Anyway, the point of this speculation is to put into perspective why, when confronted with a slow API call, the right response might be to stop making the call, rather than to shout all over the internet that the driver writers are lazy. They're not.

** When drawing from AGP memory, the rate of the vertex shader's advancing through the draw call will be limited by the slowest of how fast it can pull vertex data over the AGP bus and how fast it can push finished triangles into the setup engine. It is reasonable to expect that for large numbers of fill-heavy particles, the vertex shader is going to block waiting for output space while the shading side is slow. Is the bus idle at this point? That depends on whether the driver is clever enough to schedule some other copy operation at the same time, but I suspect that often that bus bandwidth is ours to use.

Monday, February 27, 2012

Confessions of a Lisp Hater

I hate to admit this, but sometimes I find myself missing python. For example, the other day, I wrote this horrible mess^H^H^H^H^H^Hbeautiful example of templated C++:

CollectionVisitor<Pmwx,Face_handle,UnlinkedFace_p> face_collector(&friends, UnlinkedFace_p(&io_faces)); VisitAdjacentFaces<Pmwx, CollectionVisitor<Pmwx,Face_handle,UnlinkedFace_p> >(*f, face_collector);

Once you start writing really well-factored C++ template functions, eventually you'll reach this point: the verbosity of code is dominated by having to write a ton of function objects because C++ doesn't have closures. (Yes, closures. No one is more pained to admit envy of a Lisp feature than me.)

What makes the above code particularly ugly isn't just, um, that mess, but the other shoe - the UnlinkedFace_p predicate is several lines of boiler plate for about one expression of useful C++:

struct UnlinkedFace_p { set<Face_handle> * universe_; UnlinkedFace_p(set<Face_handle> * universe) : universe_(universe) { } bool operator()(Face_handle who) const { return universe_->count(who) > 0; } };

(Note: I am trying to get the less than and greater than symbols manually converted so Blogger doesn't delete 3/4 of my templated code like it has in the past.)

What makes this function so verbose is that the "closure" - that is, the act of copying the locally scoped data "into" the predicate object has to be done by hand for every predicate we write, and the predicate can't be located in the function where it really belongs. (At least, if I try to locate my struct inline, GCC becomes quite cross. There may be fine print to get me closer to predicates inline with the code.)

In Python this would be a lot more pleasant. There's no typing, so strip out all of that type-declaring goo, and we have real closures, so the entire function chain can be written as one nested set of calls that contain pretty much only the good bits.

This got me thinking about the difference between coding in C and Python. Here comes another stupid coding quote:

Coding in C is like going out to dinner with someone who's really cheap and insists on discussing the price of everything on the menu. "You know, that pointer dereference includes a memory access - 12 to 200 cycles. That ? involves conditional code, you might mis-predict the branch. That function is virtual - you're going to have two dependent memory fetches." You know the cost of everything on the menu.

Coding in Python is like going on a shopping spree with your friend's credit card. "Go ahead, iterate through a list of lists, and insert in the middle. If it feels right, just do it. No, it's not expensive, why do you ask?"

Friday, February 17, 2012

XCode, Lion, GCC Oh My!!

My curiosity always pushes me to upgrade to the latest OS despite the caution issued to me by Ben. I upgraded to Lion and quickly found that XCode 3.2.6 cannot be installed on Lion. We're not ready to move completely to XCode 4 just yet so I had to find a work around. Luckily, there is one.

Basically, you just have to follow these simple steps:

Mount the Xcode 3.2.6 DMG
Open Terminal

Enter the commands:

export COMMAND_LINE_INSTALL=1
open "/Volumes/Xcode and iOS SDK/Xcode and iOS SDK.mpkg"

Because I need to run XCode 4 and XCode 3.2.6 side by side, I installed XCode 3.2.6 to /Developer-3.2.6 and let XCode 4 have the default /Developer directory.

Ok that gets me a working copy of XCode 3.2.6 as well as the SDKs that I need which are 10.5 and 10.6...but I'm not out of the woods yet because the X-Plane Scenery Tools requires the 10.5 SDK. I have it so there's no problem right? WRONG! The GCC being used by the system is in /Developer and if you take a look at /Developer/SDKs you'll see 10.6 and 10.7 but no 10.5. That's because 10.5 lives in /Developer-3.2.6 but GCC won't be looking there. My solution was to create a symbolic link (the linux kind ln -s, not the Mac Alias kind) in the /Developer/SDKs folder that points to the /Developer-3.2.6/SDKs folder. The actual command was:

ln -s /Developer-3.2.6/SDKs/MacOSX10.5.sdk /Developer/SDKs/MacOSX10.5.sdk

So now with that done, GCC now has access to the right SDKs. The finder view of /Developer/SDKs should look like this:

At this point, I tried to compile the libs necessary for the X-Plane Scenery Tools again but was hit with one more snag as it was failing to find bits/c++config.h as referenced by iostream.h and a dozen other files. It turns out that the new GCC is a darwin11 version of the compiler. That's fine but if you look at /Developer/SDKs/MacOSX10.5.sdk/usr/include/c++/4.2.1/ you'll notice that there are several folders that are named with respect to the GCC version they're designed for...the ones we care about are:

i686-apple-darwin*
x86_64-apple-darwin*

Here's a screenshot of what it looks like on my system:

You'll notice that there's no version for darwin11 which is why the headers cannot be found. The solution once again is a symbolic link. You'll notice that darwin8 and darwin10 are actually really links to darwin9. All that I had to do now is create a symbolic link back to version 9 for the 32bit and 64bit paths. Here's the commands:

ln -s i686-apple-darwin9 i686-apple-darwin11

ln -s x86_64-apple-darwin9 x86_64-apple-darwin11

After these steps, the universe seemed to be back in order again. XCode 4 and XCode 3.2.6 co-exist properly and the scenery tools compile using the newer version of GCC even using the older 10.5 SDK.

Tuesday, December 13, 2011

Stencil Optimization for Deferred Lights Without Depth Clamp

Using two sided stencil volumes to improve fill rate with deferred lights is not new; I'll write more if anyone wants, but this is all stuff I got off of the interwebs. The high level summary:

To save fill rate when drawing deferred lights, we want to draw a geometric shape to the screen that covers as few pixels as possible - preferably only the ones that will be lit.
Typically this is done using either a billboard or a bounding volume around the light. X-Plane 10 uses this second option, using a cube for omnidirectional lights and a quad pyramid for directional lights. (This is a trade-off of bounding volume accuracy for vertex count.)
If we have manifold bounding volumes, we can select only the fragments inside the volumes using a standard two-sided stenciling trick: we set the back face stencil mode to increment on depth fail and the front face stencil mode to decrement on depth fail - both with wrapping. The result is that only screen-space pixels that contain geometry inside the volume (causing a depth fail on the back face but not the front face) have an odd number of selections.
Once we have our stencil buffer, we can simply render our manifold volumes with stencil test to discard fragments when our more expensive lighting shader is bound.

So far, all standard. Now what happens when the near and far clip planes interfere with our bounding volume?

If the front of the volume intersects the near clip plane, that's no problem - the front facing geometry isn't drawn, but since there was no geometry in front of our light volume (how could there be - it would also be on the wrong side of the near clip plane too) this is okay.
We need to render the back face only of our volume to correctly rasterize the entire light. If we rasterize the front, we'll draw nothing when the camera is inside the light volume, which is bad. (This need to handle being inside the shadow volume gracefully is why Carmack's Reverse is useful.)

If the back of the volume intersects the far clip plane, we have a bunch of problems though.

When drawing the actual light volume, we're going to lose a bunch of our screen-space coverage, and the light will be missing.
When we're using stenciling, the increment/decrement pattern will be broken. If we have geometry in front of the entire light, it will end up off-by-one in its surface count. This in turn can interfere with other lights that cover the same screen space.

This last case shows up as a really weird looking bug in X-Plane: when the landing light is on and pokes out the far clip plane, we can get a cut-out that removes other area lights that cover the screen-space intersection of the landing light volume and the far clip plane.

The simple solution is obvous: use GL_depth_clamp to the near and far clip planes instead of clipping. But what if you don't have this extension?

In these pictures, we are seeing only a close rendering of the P180 - the far clip plane is just off the end of the airplane. The red cone extending from the tail is the pyramid light volume for the tail light that is shining out from the tail - it illuminates the top of the plane.

In the three pictures the far clip plane is progressively moved farther away. The lighter colored square is the missing geometry - since the pyramid is clipped, you're seeing only the top and sides of the pyramid but not the base. This is the area that will not be correctly stencil counted or rasterized.

Here we can see why this is a problem. Note the vertical line where the back face is missing. When we actually rasterize, we don't get any light spill - the result is a vertical clip in our light, visible on the top of the fuselage.

If depth clamp isn't available, one alternative is to restrict the Z position of each bounding volume vertex in clip space. This can be done in the vertex shader with something like:

gl_Position.z = clamp(gl_Position.z, gl_Position.w,-gl_Position.w);

(W tends to negative for standard glFrustum matrices.)

What's nice about this hack is that it is entirely in vertex shader, which means that we don't do anything that could inhibit the GPU's ability to do early or optimized Z culling.

The actual screen-space position of the view volume does not change. This is because the position edit is done in clip space, and clip space is orthographic - X and Y turn into raster positions and Z into a depth position. The "perspective" is created by dividing X and Y by W - we're free to completely whack Z without deforming the geometry as long as we are post-frustum-transform.

Wel, not completely free. There is one hitch: the actual Z test is no longer correct. Observe these two pictures:

In the first picture, we see the correct Z intersection of the view volume with the fuselage. (This picture is normal rendering with a close far clip plane, hence the lack of a pyramid back.) The area of the fuselage that is not red is outside the light bounding volume, and there is just therefore just no need to shade it.

Now look at the second picture - this is with Z clamping in the vertex shader. Because the Z position has been clamped pre-interpolation, the Z fragment positions of any face that partly extended outside the clip planes will be wrong!

In the picture we see this in the form of incorrect volume intersection. Because the far end of the pyramid has been moved closer to us (to keep it inside the far clip plane) the fragments of the entire pyramid are too close to us - almost like a poor-man's polygon offset . The result is that more of the fuselage has turned red - that is, the Z test is wrong. The actual Z error will sometimes reject pixels and sometimes accept pixels, depending on the precise interaction of the view volume and the clip planes.

The net result is this: we can hack the Z coordinate in the vertex shader to guarantee complete one-sided rasterization of our view volume even with tight clip planes and no depth clamp, but we cannot combine this hack with a stencil test because the stencil test uses depth fail and our depth results are wrong.

Thus the production path for X-Plane is this:

In the "big" world we use two-sided stenciling.
In the "small" world if we have depth clamp we use two-sided stenciling and depth clamp.
In the "small" world if we don't have depth clamp we use vertex-shader clamping and skip stenciling.

*This is actually a real question for X-Plane 10 running on OS X 10.6.8; the ATI drivers don't support the extension and in X-Plane we don't want to push out the far clip plane for the in-cockpit render.* Is there any other way?* The truth is, the motivation to keep the far clip plane close is mostly a software-organizational one - the sim could run with a farther far clip plane but a lot of code uses the real view frustum and would have to be special-cased to maintain efficiency.

Saturday, September 03, 2011

Bezier Curve Optimization

I've been meaning to write up a few notes about how we turn OSM vector data into bezier curves for X-Plane. The code is all open source - look in the scenery tools web code browser at NetPlacement.cpp and BezierApprox.cpp for the code.

First, a few basics:

OSM vector data comes as a polyline data - that is, each road is a series of points connected by straight line segments. There are a lot of points - sometimes every few meters along a road.
X-Plane 10 uses piece-wise bezier curves, meaning a road is a string of end-to-end bezier curves. Each curve can be a line segment, quadratic, or cubic bezier curver, but not anything of higher degree.
The representation in X-Plane for the piece-wise bezier curves is a list of "tagged" points, where the tag defines whether a point is a piece-wise curve end-point or control point. Semantically, the two end points must not be control points (must not be tagged) and we can never have more than two consecutive control points (because that would define a higher-order bezier).
There is no requirement that the curve be smooth - we can create a sharp corner at any non-control point, even between two bezier curves.

Both OSM and X-Plane's road system have network topology, but for the purpose of converting a polyline to a piece-wise bezier curve we care only about a single road between junctions - that is, a "curve" in topology-terms.

We divide the problem into two parts: converting the polyline to a piece-wise bezier curve and optimizing that bezier curve to reduce point count.

To build the initial beziers we take an idea from ATI's PN-Triangles paper. The basic idea from the paper is this: if we have a poly-line (or triangle mesh) approximation of a curved surface, we can estimate the tangents at the vertices by averaging the direction of all incident linear components. With the tangents at the vertices, we can then construct a bezier surface through that tangent (because a bezier curve's tangent at its end point runs toward the next control point) and use that to "round out" the mesh.

The idea is actually a lot easier to understand for polylines, where the dimensionality is lower. For each vertex in our road, we'll find the "average" direction of the road (the tangent line) at that vertex based on the two segments coming into that vertex. The bezier control points adjacent to the vertex must run along that tangent line; we can then adjust the distance of the control points from the end point to control the amount of "bulge".

We start with a poly-line.

We calculate the tangents at each vertex.

We place bezier control points along the tangents at fixed fractional lengths.

The result is a smooth piece-wise approximation through every point.

PN triangles tends to make meshes "bulge", because a curve around a convex hull always extends outward. You can see the same look in our interpolation. This is good for on-ramps but actually looks quite bad for a straight-away road - the road "bulges out" when the curve ends.

To address this, we categorize a road as "straight" if the road is long enough that if we built a curve out of it, the radius of that curve would be larger than a constant value. (We pick different constants for different types of roads.) In other words, if two highway segments are each 1 km long and they meet at a 3 degree angle, we do not assume it is part of an arc with a 19 km radius - we assume that most of the 1 km road are straight, with a small curve (of much smaller radius) at the end. For any given point, we can decide whether either one or both of the two adjoining line segments is "straight" or should be entirely curved. We then form five cases:

If two segments come together at a very sharp angle, we simply keep the sharp angle. We assume that if the data had this sharp angle in the original vector data (which is quite highly noded) then there really is some kind of sharp corner.
If the two segments come together at a very shallow angle, we simply keep the corner, because who cares. This case matters when we have a very tiny angle (e.g. 0.25 degrees) but very long line segments, such that removing the tiny angle would cause significant change in the vector position due to the long "arm" and not the steep angle. We trust that for our app the tiny curve isn't going to be visible.
If the two segments are both curved, we use the PN-triangle-style tangents as usual.
If one of the segments is curved and one is straight, the tangent at the point comes from the straight curve. This takes the "bulge" out of the beginning and ending of our curves by ensuring that the curve ends by heading "into" the straight-away.
If both segments are straight, we need to round the corner on the inside. We do this by pulling back the corner along both curves and using the original point as a quadratic bezier control point.

One note about cases 2 and 5: if you pull back a vertex proportional to the angle of the vertex, then very small angles can result in very small pull-backs, resulting in a very short curve. In the case of X-Plane, we want to ensure that there is a minimum distance between points (so that rounding errors in the DSF encode don't give us colocated points); this means that the minimum angle for case 2 has to be large enough to prevent a "tiny curve" in case 5.

The five curve cases, illustrated.

With these five curve cases we get pretty good looking curved roads. But our point gets out of control - at a minimum we've kept every original point, and on top of that we've added one or two bezier contrl points per segment.

What we need to do is generalize our curves. Again, the PN-triangles observation can help us. If we want to replace two piece-wise bezier curves with a single one, we know this: the tangent at the end of the curves can't change. This means that the two control points of the approximate curve must be colinear with the control points of the original curve ends and the original curve ends itself.

So what? Well, if we can only move the control points "in and out" then there are really only two scalar variables for all possible approximations: how much to scale the control handles at the start and end. And that's something we can check with brute force!

Below is the basic step to approximating a piece-wise bezier curve with two pieces as a single cubic bezier.

We start with a piece-wise bezier with nodes and control points.

For each range of curves that we will simplify, we "push" the outermost control points along the tangent vector by an arbitrary scaling factor.

The resulting curve will be close, but not quite the same as the original.

The original also looks "reasonable" on its own - that is, the approximations tend to have good curvature characteristics.

To find the approximate curve, we simply "search" a whole range of scalar values by trying them and measuring curve error. In the scenery tools code, we do a two-step search, refining the scalars around the values of least error. The initial values are picked experimentally; it's almost certainly possible to do a better job of guessing scalar values but I haven't had time to research it more.

To measure the error we approximate the bezier with polylines (e.g. we turn each individual bezier into a poly-line of N segments) and then compare the polylines. The polyline comparison is the variance of the distances of each point in one polyline to the other. (in other words, we treat one polyline as a point set and take the variance of the distance-to-polyline of each point). This is similar to the Hausdorff distance with two key differences:

Because we are taking variance and not a minimum error, we can't use our previous minimum distance from a point to a line segment to limit our spatial searches. (See below.) Instead, we pick some large distance beyond which the curves are too different and we use that to limit. For low maximum acceptable errors this gives us good performance.
Since the variance depends on all points and not just the worst one, we can rank multiple approximations - that is, generally better approximations score quite a bit higher.

To speed up the polyline compare (which is naively O(N^2) and dominates CPU time if implemented via a nested for-loop) we can create a spatial index for the "master" line (since we will compare many candidates to it) and search only a few of the master segments when looking for the closest one. If we know that past a certain error we're simply broken, then we can limit our queries. For my implementation, I pick an axis (e.g. X or Y) based on the original curve shape and then subdivide the polyline into monotone sub-polylines that can be put into a coordinate-sorted map. Then we can use lower_bound to find our segments in log(N) time. An rtree would probably work as well.

Phew. So we can take a piece-wise bezier and come up with the best approximation through brute force and error checking. How do we simplify an entire road? The answer is not Douglas-Peuker. Instead we use a bottom-up combine:

For every non-end node in the piece-wise curve, we build the approximation of the two adjoining bezier curves and measure its error.
We queue every "possible merge" by error.
Until the queue is empty or the lowest error is too large we..
Replace the two curves in the merge by one.
Recalculate the two neighboring merges (since one of their source curves is now quite a bit different). Note that we must keep the original beziers around to get accurate error metrics, so a merge of two curves that originally covered eight curves is an approximation of all eight originals, not the two previous merges.

Why is this better than DP? Well, the cost of approximating two curves is going to be O(NlogN) where N is the number of pieces in the piece-wise curve - this comes straight from the cost of error measurement. Therefore the first run through DP, just to add one point back, is going to be O(N^2logN) because it must do a full curve comparison between the original and every pair it tries. When the curve approximation is going to contain a reasonably large number of pieces (e.g. we might merge each bezier with its neighbor and be done) the bottom-up approach gets there fast while DP does a ton of work just to realize that it didn't need to do that work. (DP doesn't have this issue with polylines because the comparison cost is constant per trial point.)