Friday, October 08, 2010

Why GPU Sliced Shadows Fail For Clouds

I have discovered through experimentation that NVidia's technique for self-shadowing particle volumes (found here) doesn't work well for a flight simulator cloud system. When reading a white paper, it can be hard to judge the appropriateness of an algorithm for a particular application; here's what went wrong in our case.

The Basic Algorithm

The basic algorithm is something like this:
  1. Sort the particles to be directional for both the light source and the viewer. (This can require rendering front-to-back to the viewer at times.)
  2. Along this direction, slice the particles up. For each slice, plot first, then update our shadows.
  3. Composite the finished system to screen (necessary if we are going front-to-back).
The algorithm produces nice soft self-shadowing because the shadow texture is being incrementally updated as we move through the slices.

The algorithm does work well; for a test case with a cloud built to meet the algorithm's requirements, the shadows were soft, real-time, and quite plausible.

Performance Bottlenecks

The algorithm has two basic performance bottlenecks:
  • Like all over-drawn particle system algorithms, it is fill rate limited if we overlap too many particles.
  • Slicing requires finishing rasterization to a texture and then using the texture, so the algorithm is bound by the number of slices. (The slicing can affect both time spent in the driver rebuilding the pipeline, including costs of changing the render target, and it can stall depending on how smart your driver is about requiring pending rasterization to complete.)
The paper points both of these out and notes that the number of slices may have to be traded for performance.

Overdraw and Alpha

The algorithm is a little bit mismatched to a flight simulator cloud system because a flight simulator cloud system typically uses a smaller number of more opaque cloud particles to avoid fill-rate issues. This causes problems because the algorithm doesn't naturally diminish self-shadowing; it depends on the fact that we haven't accumulated a large number of particles to keep shadows very light when two particles are near each other.

So the first problem in general use is that the quality of the shadows fights with the optimization of relatively opaque particles. As soon as we make fewer, smaller, more opaque particles (which can be coped with via texturing) the quality of the shadows becomes quite poor.

Slicing and Bucketing

The second problem is that for a general large-scale particle field we need some kind of bucketing, and this fights with slicing. We want to break our particles into a bucket grid for two reasons:
  • It gives us a way to rapidly cull a lot of particles.
  • The bucket grid has a traversal order that is back to front, so we only need to Z-sort within a bucket, saving a lot of sorting time.
The problem is this: we don't know the relationship spatially between slices of different buckets, so we need to slice within a bucket, but do this for each bucket on screen. So if we have 12 buckets on screen, we have 12x the number of slices.

Slices are really quite expensive due to the GPU setup overhead, and even a small number of buckets means that we can't afford enough slices. NVidia recommends 32-128 slices, but with buckets, you'll be lucky to get 8 slices per bucket.

Low Slice Count = Ugly

It goes without saying that having a small number of slices is going to produce less correct shadows. But there is another, more serious problem: as you rotate the camera, the slicing plane changes. Nearby particles that are in the same plane will not shadow each other, but when/how this happens is a function of how wide the slicing plane is and which way it goes.

What this means is: as we rotate the camera, some particles will suddenly stop shadowing each other as the slicing planes rotate, causing noticeable popping artifacts.

The really bad artifact comes when we go from having the sun slightly facing to us to slightly facing away from us. At that point the algorithm will switch between back-to-front and front-to-back rendering, and the slicing plane will jump by 90 degrees almost instantly. This produces a huge number of artifacts when the number of slices is small.

Summary

The algorithm fails when:
  • We have mostly opaque particles and
  • We can't afford enough slices and
  • There are external constraints (like culling) artificially "wasting" slices.
Unfortunately, that is us...so...on to other techniques.

Thursday, October 07, 2010

Alpha Blending, Lets Try Again

A while ago I posted this convoluted mess of recipes for blending back to front and front to back. I've had some time to revisit the code, and the actual formulas are simpler than I realized and more consistent; they also don't require split blending functions for the back to front composited case, which is nice if you want to run on, well, on dinosour hardware (since pretty much anything you can find has split blending functions from the Radeon 8500 on).

Premultiplied Alpha

The goal is to composite several translucent textures together, and then composite them over our scene as if the whole scene had been drawn in order. In order to make this work, we want to use premultiplied alpha - that is, textures where the RGB color has already been made 'darker' if the alpha channel is not 1.0. In this scheme our blend function can be (1.0, 1.0 - SA) instead of the normal (SA, 1.0-SA) because the source pixel is already multiplied by SA. That would be the premultiplication.

Why is premultiplication a good idea? We have to solve the problem of "what is under translucent", and premultiplication does that. In a premultiplied texture, the RGB channel becomes more black as it becomes more transparent, and thus "nothing" has a valid color representation (black). In a traditional texture, there is color behind transparent, and that can cause sampling artifacts.

So our goal is to composite a premultiplied texture. That means that the "clear" will be 0,0,0,0 (black, transparent). Note that while the color is black (meaning nothing to add color-wise) we still need that alpha channel to be 0 (transparent) too to tell us that the background won't be occluded.

Fixing Back to Front

If you have ever blended together a bunch of geometry (back to front) and then composited the result on top of something else, you know that the alpha channel for that back-to-front geometry is going to be pretty screwed up. To see the problem, imagine blending a really light (10% alpha) screen over an already opaque scene. That light screen will (by a "strength" of 10%) move the alpha channel away from opacity and toward translucency. The problem is that the alpha blends itself, and we don't want that.

It turns out that pre-multiplied alpha can fix this. We set our blending equation to (1.0, 1.0-SA) and we pre-multiply our RGB. Our alpha will now be the old alpha (lightened by the amount the new alpha is "covering it") plus the new alpha, but not lightened.

To take the case of a 10% screen over an opaque scene, the alpha will be 0.1 * 1.0 + 1.0 * (1.0 - 0.1), which gives us...1.0, which is exactly right: blending over an opaque object doesn't make it translucent.

Front to Back

For the front to back case, we still want to use pre-multiplied alpha, but we set our blend factors to (1.0-DA, 1.0). With the back to front case in "pre-multiplied" form, this should look very symmetric. In fact, all we're doing is changing which one is the "master" (whose alpha cuts down the other" and which is not).

What effectively happens is:
  • The less alpha is in the buffer already, the more you get to draw (ehnce 1.0-DA as a factor).
  • The buffer is never reduced in color (which makes sense, since you can't darken something by drawing behind it).
  • The amount of alpha opacity you leave behind/add-in is also reduced by what is already there (you matter less if you are behind something translucent).

Wednesday, October 06, 2010

Premultiplication: Pros and Cons

I realized today that premultiplied alpha could fix a nasty artifact that we sometimes get in X-Plane: "tree ring".*

The bug is this: imagine you have two texels in your texture. The left one is transparent, and the right one is opaque green (a tree). What is the RGB "behind" the transparent one? Let's call it junk.

When this texture is sampled with linear filtering, the graphics card will do the wrong thing: it will blend the two texels by channel to come up with a texel sample that is a mix of green + junk in the RGB channel and a translucent alpha channel. Thus at the edges of our alpha-blended tree, we will see a 'ring' of junk leaking into the texture.

The traditional work-around (and the one we use for X-Plane) is to ensure that the RGB behind the transparent parts of the texture contains something valid that we wouldn't mind seeing, e.g. "more green". This is not an ideal work-around because Photoshop will put white in this space when alpha reaches 0%, so most artists will have to manually fix this problem over and over (and it's not an easy problem to see since the erroneous color is behind a 0% alpha pixel).

If we used pre-multiplied alpha, this would not be a problem. With premultiplied alpha, the RGB pixels are already multiplied by the alpha channel; thus the transparent pixel is by definition black (0% alpha * any RGB = 0,0,0 = black). Thus when we blend green and black we get "darker green", which is the appropriate pre-multiplied color for a linear sampling at the edge of our tree. Simply put, premultiplying puts the alpha multiply before linear interpolation, which i what we want.

Compression?

I can think of a possible reason to not use pre-multiplied alpha in production art assets: texture compression. If I have a solid green tree with an alpha channel, my texture compressor uses all of its "color bits" to get that green color right. But if I premultiply, those color bits are now storing both the color and the effect of alpha (the darkening). I may get some color distortion on my tree because the compressor is trying to get the pre-multiplied alpha right.

In other words, a non-premultiplied texture may compress better. Ideally I'd like my compressor to be alpha-aware, that is, optimize the color under the opaque part at the expense of what is under the transparent part.

The Rest Of the Story

Obviously we're not going to change X-Plane to premultiplication given so many art assets out there. But there is more to the story too.

The * up there is that there is a second, significantly worse cause of "rings" on trees: z-buffer artifacts. The z-buffer doesn't handle translucency very well (and by that I mean it doesn't handle it at all). If our trees contain translucent edges due to linear filtering, we get Z put down over the translucent parts, and that cuts out any 3-d building or additional trees behind them. The result is "blue rings" where the sky shows through what should be a forest.

The solution is the one we use in practice: we turn off blending entirely and simply test the texels - they are in or out. We still use linear filtering though, so that the alpha edge of our tree isn't square and jagged, so we would see a ring if we have bogus color underneath the transparent parts of the trees. Since in practice we almost always ship DXT compressed textures, the compression argument against pre-multiplication holds.