Monday, November 30, 2009

Per-Pixel Tangent Space Normal Mapping

This post assumes you know what tangent space normal maps are in general, and roughly how they work geometrically. If you try to code this, you'll hit one tricky problem: how do you convert your normal map from tangent to eye space? You need the normals in eye space to do lighting calculations.

The short answer is: a few lines of GLSL goo do the trick. Before explaining why those lines do what they do, we need to understand basis vectors.

Basis Vectors

Basically you can convert a vector from a source to a destination coordinate system like this:
new_ = [ dot(old, X), dot(old, Y), dot (old, Z) ]
where X, Y and Z are basis vectors - that is, X,Y, and Z are the axes of the old coordinate system expressed in the units of the new coordinate system. If you know where the old coordinate system is in terms of the new coordinate system, you can convert.

(A side note: the basis vectors form a matrix that performs this transformation, e.g.
X_x Y_x Z_x
[ X_y Y_y Z_y ]
X_z Y_z Z_z
What does X_y mean? That is the y component of the X axis of the old coordinate system. Because the coordinate systems are not aligned, the old X axis may be a "mix" of the new coordinate system.)

I strongly recommend taking the time to deeply understand basis vectors. Once you understand them, OpenGL matrices start looking like a geometric shape instead of a blob of 16 random numbers.

So if we want to convert our normal map from tangent space to eye space, we need basis vectors - that is, we need to know where the S, T and N axes are in terms of eye space.

(A side notation note: typically these are known as the Tangent, Bitangent and Normal vectors for a TBN matrix. When I say "S" and "T" I mean the direction on your model that matches the horizontal axis of your texture and vertical axis of your texture.)

We already know "N" - it's our normal vector, and we usually have that per vertex. Where do we get S & T?

The Derivative Functions

GLSL provides derivative functions dFdx and dFdy. These functions return the change in value of an expression over one pixel on the screen horizontally and vertically. You can pass anything into them. If you pass in a texture coordinate, you can find out how much your UV texture will change as you move right or up one pixel. If you pass in a location, you can find out how much your location will change per pixel. If we pass in a vector, the "derivative" is performed on each component.

My understanding is that on some hardware, pixel shaders are run on a block of 2x2 pixels. The same instruction is run 4 times for 4 sets of input data. When the shader hits the "derivative" function, it simply takes the difference between the emerging values in each of the blocks to find the deltas.

The key points about the derivative functions: we can apply them to anything, and they take derivatives along the X and Y axis in screen space.

Solving Parallel Equations

From our discussion of basis vectors we know that we the S and T vectors basically do this:
dx = ds * Sx + dt * Tx
dy = ds * Sy + dt * Ty
dz = ds * sZ + dt * Tz
In other words, they map from UV coordinates to eye space.

Don't we need three basis vectors? Well, yes. The third component would be dn * Nx, etc. But our UV map is effectively flat - that is, it's "N" coordinate is always 0. So we will drop out the third basis for now. (This should be expected - the normal vector is defined as tangent to our mesh, and the UV map is entirely on our mesh.)

So...we have 3 equations of 2 unknowns. For each, if we only had 2 sets of values (e.g. 2 sets of dx, ds, dt) could solve for our basis vectors.

Well we can! If we take the X derivative (dFdx) of our texture coordinates and eye space mesh location, we would get: Q.x (the change of eye space position) ST.s (the change of the S coordinate of our UV map ) and ST.t (the change of our T coordinate in the UV map). And we could do this twice, using dFdx and dYdx to get two separate sets of points.
vec3 q0 = dFdx(position_eye.xyz);
vec3 q1 = dFdy(position_eye.xyz);
vec2 st0 = dFdx(gl_TexCoord[0].st);
vec2 st1 = dFdy(gl_TexCoord[0].st);
Now we have a set of 6 constants to plug in to solve our two unknowns:
dx = ds * Sx + dt * Tx
With the substitution (q0 vs q1 is our derivatives from dFdx and dFdy, etc.):
q0.x = st0.s * Sx + st0.t * Tx
q1.x = st1.s * Sx + st1.t * Tx
We can solve this to come up with:
Sx = ( q0.x * st1.t - q1.x * st0.t) / (st1.t * st0.s - st0.t * st1.s)
Tx = (-q0.x * st1.s + q1.x * st0.s) / (st1.t * st0.s - st0.t * st1.s)
The same trick can be pulled for the X and Y axes - we'll use the same input st0 and st1 derivatives but the y or z components of our object-space derivatives. The resulting six values Sx Sy Sz, Tx Ty Tz are our basis vectors in tangent space.
vec3 S = ((q0 * st1.t - q1 * st0.t) / (st1.t * st0.s - st0.t * st1.s));
vec3 T = ((-q0 * st1.s + q1 * st0.s) / st1.t * st0.s - st0.t * st1.s));
Cleaning It Up

What units are our newly formed basis vectors? Well, they're in the right unit to rescale from UV map units (1 unit = one repetition of the texture) to our eye space units (whatever the hell they are). The main thing to note here is that S and T are almost certainly not unit vectors, but if we we are going to convert a normal map from tangent to eye space, we need unit vectors. So we are going to normalize S and T.

Now we can make an observation: the denominator of S and T (st1.t * st0.s - st0.t * st1.s) is the same for S and T and more importantly the same for all three axes. In other words, that divisor is really a constant scaling term.

Well, if we are going to normalize anyway, we don't really need a constant scaling term at all. Nuke it!
vec3 S = normalize( q0 * st1.t - q1 * st0.t);
vec3 T = normalize(-q0 * st1.s + q1 * st0.s);
When Does This Break Down

You may know from experience that tangent space normal mapping will fail if the author's UV map is zero or one dimensional (that is, the author mapped a triangle in the mesh to a line or point on the UV map). From this equation we can at least start to see why: our denominator was
st1.t * st0.s - st0.t * st1.s
When is this zero? It is zero when:
st1.t / st1.s = st2.t / st2.s
Which is to say: if the direction of the UV map derivatives are the same if we go up the screen or to the right on the screen, our UV map is degenerate and we're not going to get a sane answer.

Sunday, November 29, 2009

Immediate Mode Instancing

Some definitions: instancing is drawing the same mesh over and over, but in different places/poses/transform/whatever. Instancing comes into play when, for example, you want to draw 8000 of the same runway light fixture or 5000 of the same car. Any time we have a truly huge number of, um, things, we can expect a larger number of instances than art assets.

The idea of instancing is to pay once in CPU time for the entire set of instances that share the same art assets, rather than paying once per instance. Since CPU time is often the limiting factor on rendering, this lets us get a lot closer to the geometry throughput the card is capable of.

Now the simplest way to draw a lot of stuff is something like this:
Set up all GL state first
For each instance
glPushMatrix();
glMultMatrix(...);
glDrawElements(...);
glPopMatrix();
This is instancing by matrix transforms, and it is surprisingly faster than you'd think. (That is, I am surprised that the matrix transforms aren't more expensive.) But we are still paying per instance.

Before continuing, I cannot say this strongly enough: no state change inside the for loop!!! I say this because if you allow state change but try to minimize it, it can sneak up on you. I went to inventory X-Plane's objects for "inner loop state change", expecting to find about 20% of our urban-area art assets doing this. As it turns out, it's more like 80% that are doing this. We're losing 30% or more fps due to this state change.

Immediate Mode Instancing

Immediate mode instancing goes something like this:
Set up all OpenGL state
For each instance
glVertexAttrib (to set matrix transform)
glDrawElements
In conjunction, your vertex shader decodes the vertex attributes that the matrix is being passed down in.

It turns out that this code is at least 30% faster than matrix-transform instancing on OS X. And in hindsight this shouldn't be surprising. I would expect the built in matrix stack to be uniform state, as it isn't expected to change per vertex. And I would expect the update of attribute state to be faster than the update of uniform state.

It may also be that the GL has to do legacy processing with matrices (such as computing inverse matrices) that can be avoided.

Why would you use immediate mode instancing instead of real hardware instancing? Well, if you are on an OS that, despite the availability of hardware instancing for years, doesn't provide the extension, immediate mode instancing provides a useful half-way point.

(In particular, you can pull your immediate mode instance values directly out of the instance array you would have used.)

Sunday, November 15, 2009

Threads: Who Am I?

I was curious as to what the cost is of retrieving "thread local storage" (per-thread variables). I am looking to move some variables from globals to per-thread, and I want to know how expensive access will be.

I went digging into the Linux thread library source. First: pthread_getspecific (which pulls out a specific variable) works by:
  1. Finding the thread's "identity" as a pointer to a basic control block.
  2. Thread-local storage is simply an array at the end of the control block.
  3. The specific retrieval is just an array access.
So specific key retrieval isn't going to any worse than getting the thread itself. How does Linux do that?

In a way that I thought was quite clever: stacks are stored on a fixed granularity/page size. The calling thread has a stack pointer within its own stack. The thread control block is at the very beginning of the stack.

So all the calling code has to do is take a stack-local variable and round its address down to the thread stack granularity and there's the control block. That's pretty quick! No system calls needed.

Edit: here's an even cheaper way.

Thursday, November 12, 2009

Perspective-Correct Texturing, Q Coordinates, and GLSL

This morning I googled perspective texturing in response to a post on the Apple OpenGL mailing list, and this post came up. It's not a very clear post, so I wanted to embellish on it with a few more points.

There Are Two Ways to Get "Perspective" Texturing

"Perspective" texturing basically means that the sampling of pixels is non-uniform along a line in screen space. In OpenGL (assuming you haven't hacked the heck out of your ST coordinates per pixel in a fragment shader) this always comes from a "perspective" division somewhere in the pipeline. In fact there are two ways this can happen:
  1. Interpolation of per-vertex data is done in a "perspective correct" manner. In other words, if you tip your triangle over in 3-d space, interpolation of all of its attributes should "look right". In this case, the texture coordinates are varying over the pixels at the right rate to create the perspective sampling effect.
  2. If there is a "q" coordinate attached to your texture coordinates, it will be interpolated first, then divided. This creates non-linear sampling., since we are sampling 1/x where x is varying linearly. We are not linearly interpolating between 1/x1 and 1/x2.
What I didn't like about my old post is that it swizzles these two cases (perspective correct varying vs. perspective divide at texture sample time) a bit without saying what is happening when.

So this post will present four cases of perspective correct texturing and explain why we get the right results.

Perspective Correct Varying Is More Or Less Free

Unless your GPU belongs in a museum, it almost certainly does perspective correct interpolation of per-vertex data correctly - on modern hardware this just happens, and you can't get rid of it.

Perspective Correct Geometry

The simplest case, and the one we care about most, is perspective correct texturing on geometry that has been foreshortened by a glFrustum projection matrix. This is a case of correct interpolation of varyings, not q coordinates.

In this case, the when the GPU rasterizes a triangle, it interpolates the varying values between vertices in a "perspective correct" manner - that is, to calculate the weight of the varyings from the 3 triangle vertices, it interpolates the X, Y, and W coordinates separately, then does a divide per pixel. (If we interpolate X/W and Y/W, that is, the screen-space pixel coordinates, we would get affine sampling, which is ugly.)

This case works as long as your perspective comes from a matrix that puts some "goo" in the 'w' coordinate for perspective divide. That's like saying "this works as long as the sun hasn't gone out", which is why perspective correct texturing isn't something game developers have to worry about anymore.

Perspective Deformation in 2-D

What if you want to make something look like it is 3-d when it isn't? That's what Plane-Maker does with its instrument warping function. You can drag the 4 corners of an instrument to make a trapezoid and we calculate the "faux perspective".

The way Plane-Maker does this is by finding a perspective matrix that correctly transforms the corners to the desired positions. As mentioned in the previous article, it turns out that such a "2-d perspective' matrix can take the form:
X1 X2 0  X4
Y1 Y2 0 Y4
0 0 1 0
W1 W2 0 1
And by some lucky bit of mathematical serendipity, that's a matrix with 8 unknowns and we have 8 equations to plug in: the transform equation for all four corners (4 corners x 2 axes = 8 equations). Sure enough you can "solve" this matrix with 8 unknowns and you get a "2-d perspective" matrix.

What this matrix is really doing is putting the right values in the 'w' coordinate to make your vertices deform. Compare this to the usual frustum matrix, which simply puts the -Z coordinate into 'w'. The above matrix "works" because perspective simply requires something in your 'w' coordinate - it doesn't have to come from depth!

Like the geometry case above, this case textures correctly because the texture coordinates will be interpolated using that 'w' coordinate.

Explicit Q Coordinates

If you change the 'q' coordinate with glTexCoord4f or something similar, any perspective effects you get come from the divide at texture sampling time. Note that for GLSL fragment shaders, you need to use the "Proj" variant of the sampling routines.

You may also be getting additional perspective from your geometry. If you make a trapezoid, texture it with a square, and calculate the right 'q' coordinates on your texture to get sampling without distortion, but then you rotate the camera around this trapezoid, it's the 'q' coordinate that makes the texture fit the trapezoid, but varying interpolation that makes it look the same from all camera angles.

Projective Texturing With Perspective

There is one case where you might need the 'q' coordinate but not realize it: projective texturing with a frustum projection. Compare the matrices created by glFrustum and glOrtho. glFrustum puts your -Z (in eye coordinates) in the 'w' coordinate, while glOrtho always puts the constant 1.

If you want to "project" a texture onto a model (think of a slide projector projecting a slide onto a wall, but maybe also projecting onto a chair) one simple way to do this is to use the same kinds of matrices you might use to set up the camera. (This works because both camera and texture projection matrices map 3-d coordinate inputs into 2-d screen-space outputs.) One note: you will need to rescale the post-projection matrix from the range -1..1 (which is what the screen likes) to 0..1 (which is what texture samplers like).

If the "fourth row" of your texture projection matrix (the one passed to glTexGen with GL_Q) generates something other than "1" then your projection matrix has perspective, and the perspective comes from the divide at texture time, not the interpolation of varyings.

You can observe this in a GLSL shader by changing a texture sampler Proj call to a non-Proj call. Without the "Proj" your texture coordinates will go completely haywire if you have a perspective matrix.

Tuesday, November 10, 2009

CGAL Performance - Work In Bulk

A performance tip when using CGAL 2D Boolean Operations: you'll get much better performance if you can perform multiple operations at once.

I wrote code that did something like this:
for each polygon
my_area.difference(polygon)

That snippet took 36 minutes to run. So I rewrote it like this:
vector all;
for each polygon
all.push_back(polygon)
Polygon_set_2 sub_area.join(all.begin(),all.end());
my_area.difference(sub_area);
New running time: 3 minutes. Why the huge speedup? Well, every time you do a boolean operation between two polygons, the underlying arrangements are merged using a "sweep" algorithm, which is an efficient way to merge two big piles of lines. The sweep operation's time is O(NlogN) which is very good for this kind of thing, but N is the sum of the edges in both polygon sets.

If the area we are subtracting from is very complex, this means that for each subtraction we do an operation based on the big polygon's area. Ouch!

The second snippet wins in two ways:
  • We do only one operation against "my_area", so if my_area is complex we eat the cost of going through the area only once.
  • A "join" on a range of polygons is very fast because CGAL will divide and conquer in groups, to minimize the total number of operations. That is, a join on a range is faster than a join on each individual item.
If you run a profiler like Shark on CGAL and see "sweep_line_2" taking all the time, use techniques like the one above to cut down the total number of merges. It can make a huge difference!