Wednesday, December 22, 2010

Fun With GLSL Compilers

I've been poking at GPU ShaderAnalyzer; this Windows performance tool from ATI gives you a simple environment: you enter GLSL, hit compile, and look at the assembly that pops out. Perfect! (It also shows you execution cycles, but for my purposes, namely understanding what the compiler does for me, the assembly is key.)

Here are some things I have learned:
  • RV790 assembly is quite complex. (Thank goodness I don't have to code it myself.) ALU instructions consist of 5 scalar sub-instructions, only one of which can have transcendental opcodes. There's a bit of fine print; this and this make useful reading. One thing to note: the ALU has a number of 'small' tricks (absolute value, negation, clamping) 'for free'. Sometimes the compiler will use these tricks, sometimes not.

    Generally, if you write vectorized code (e.g. uniform work on vec4) the scheduling will work out nicely. But the units of execution really are scaler, so it doesn't make sense to write work that isn't needed.

  • The compiler inlines pretty much everything, which is just fine by me. (I have no idea if recursion is legal in GLSL, I'd never use it in production, but when I wrote a recursive factorial function, the compiler simply inlined 127 iterations and called it a day. Awesome!)

  • The compiler understands a reasonable amount of constant folding, including (most importantly) multiply by zero. For example: I write an expensive albedo function: pow(gl_Color,gl_Color.aaaa) and multiply it by a light function that returns vec4(0.0). The result: the compiler nukes the entire code sequence and simply loads 0.

    (BTW, pow is expensive - since only one of the five ALU slots can run log and exp, raising each color channel to a non-constant power takes eight instruction groups! Ouch.)

  • The compiler will remove conditional code when the condition is fully known at compile time. So for example, an if statement where the comparison comes from functions that return constants will be nuked, and one of its two clauses is deleted.

  • The compiler does not seem to do inference, at least in the one case I looked at. By inference I mean: if (max(0.6,gl_FragColor.r) > 0.3) will (ignoring NaN logic) always be true, regardless of gl_FragColor. But for the compiler to know this, it has to make an inference - that is, it has to compare the range [0.6..inf) with 0.3. My understanding is that LLVM can do this kind of thing, but when I tried it in shader I simply got the full, expensive, conditional code. Moral of the story: use and apply your human brain. :-)
Now...why do I care?

X-Plane's physical shader is based on conditional compilation - that is, for any given state vector of "tricks" we want to use, we recompile the shader with some #defines at the front which turn features on and off. The result is a large number of shaders, none of which need conditional logic in-shader. Fill rate isn't consumed by features we don't use. (This technique comes from our original use of GLSL to emulate and then improve on the fixed function pipeline. To match fixed-function performance, we had to 'compile out' anything we didn't use, particularly for first-gen DX9 hardware which doesn't give you conditional logic for free.

The problem with this technique (and you can see this in the X-Plane 9 shaders) is that it doesn't scale well with code size. For version 10 we've done a lot of shader work, and hand-optimizing the conditional logic is getting more and more difficult.

My conclusion from observing the compiler is that 99% of the time, I can relax a little bit and let the compiler take care of optimizing the shaders down. In particular, if I define functions for each stage of the shader and use conditional compilation to 'simplify' the rule, then the simple cases will boil down to very few instructions. For example:
float calc_spec()
{
#if has_spec
return pow(max(0.0,dot(eye_nrm,sun-vec)),128.0);
#else
return 0.0;
#endif
}
void main()
{
....
float s = calc_spec();
gl_FragColor = albedo * lighting * shadow + ambient + shadow * vec4(s,s,s,0.0);
}
In this mess, our specularity function is subject to conditional removal for non-shiny materials. When we do this, not only is the actual specularity calc removed, but the compiler will figure out that 's' will alwys be 0.0 and nuke the MAD of shadow * specularity into the final lighting sum.

That's a trivial example, but it shows the principle of structuring the components separately and letting the compiler put the mess together.

As a final note: the compiler's optimization is not perfect; I suspect the above technique will 'leak' a few instructions in the simple cases relative to a one-off carefully hand-coded GLSL shader, and the GLSL isn't going to be quite as tight in a few cases as actually writing assembly.

But I can live with that, most of the time. We can always go and hand tune the performance cases that absolutely matter most, and the time saved working on the huge mess that is the conditional shader gives me the time to do that hand optimization where it is most important.

Monday, December 20, 2010

Lisp Isn't a Language...

Someone sent me this fun list of programming quotes. From Alan Kay:
Lisp isn't a language, it's a building material.
And that material is dung.

Thursday, December 09, 2010

What OOP Isn't

When I took my first computer science class (I had already been programming on my own for a while) the department was going through something of a civil war. Some of the department had gotten religion in the form of object oriented programming (OOP) and were trying to thrust it on everything. They made us implement a linked list in an OOP way: every node was fully encapsulated!

(If you're wondering how this works, most list editing operations had to be stack-recursive - a node would set its 'next' to the return value of a call on its next, allowing a node to 'cut itself out'. It made it very hard for students who had never used linked lists to understand what was going on, because they had to learn recursion and linked lists at the same time. The result was something with the performance of LISP and elegance of C++. It was horrible.)

They told us that OOP involved encapsulation, polymorphism, and inheritance; I have commented in the past on why this last idea is often just a poor idea. At the time, in school I only had enough programming experience to say that what we were being taught (all OOP, all the time, period) was a lot more difficult than what I had been doing (use an object for something big, like a game character) and was producing code that was more verbose and not particularly fast. Now that I have some software engineering experience, I think i can articulate the problem more precisely.

When talking to a new programmer who is looking at OOP and trying to figure out what it's all about, I say that the relative importance of encapsulation, polymorphism, and inheritance is approximately 90%, 10%, 0% respectively. The vast majority of the value of OOP is that it provides an idiom to keep one piece of code from becoming hopelessly intertwined with other pieces of code, and that's valuable in large software projects. It's also impossible to teach to undergraduates because they never have a chance to write enough code for it to matter.

Polymorphism is nice, but in my experience it's not as useful as encapsulation. If you have a polymorphic interface, you have an interface, which means that it's encapsulated...but there are plenty of cases where an interface is one-off and has no polymorphic properties. Maybe 90%-10% is harsh, but I think it's the encapsulation that matters. It may be that some product spaces are more polymorphic than others. WorldEditor (LR's open source scenery editor) has polymorphic hierarchies for most of its core components, while X-Plane itself has very few.

I bring this up because I'd like to advance (in a future blog post) a comparison of OOP techniques to others (for real software engineering problems), but OOP comes with a bit of baggage. The notion that OOP would make us better programmers, help us write bug free code faster, or help bad programmers become good programmers have all proven to be naively optimistic. (In particular, bad programmers have proven to be surprisingly resourceful at writing bad code given virtually any programming idiom.)

So I'd like to define (OOP - hype) as something like: good language support for idioms that make encapsulation and sometimes polymorphic interfaces faster to code. And that's useful to me! I could code the same thing in pure C, but it would make my repetitive stress injuries worse from more typing, so why do that?

FMTT, GLSL Edition

This is in the same vein as I Hate C -- all of its derivatives are contaminated with its brain damage.
gl_FragData[0] = vec4(tex_color.rgb * gl_Color.rgb*tex_color.a,clamp(tex_color.a + lit_color.a,0.0,1.0));
gl_FragData[1] = vec4(shiny_ao * cut_pos, cut_pos*position_eye.z/-1024.0, 0.0, cut_pos);
gl_FragData[2] = vec4(normal_eye_use.xyz*cut_pos, cut_pos);
gl_FragData[3] = vec4(lit_color.rgb + tex_color.rgb * gl_FrontLightModelProduct.sceneColor.rgb, (tex_color.a + lit_color.a,0.0,1.0));

Sunday, December 05, 2010

Yet Another This-Is-Our-GBuffer-Format Post

If you read just about any presentation by a game studio (E.g. for GDC or Sigraph) on deferred rendering or deferred lighting, they'll probably discuss their G-Buffer format and how they tried to pack as much information into a tiny space as possible. X-Plane 10 will feature a deferred renderer (to back the global spill feature set). And...here is how we pack our G-Buffer.

The good news with X-Plane is that we don't have a complex legacy material system to support. If a game is well-batched, the forward renderer can associate separate shaders with batches by material, and each 'material' can thus be radically different in how it handles/transfers light. With a G-Buffer, the lighting equation must be unified and thus everything we need to know to support some common lighting format must go into the G-Buffer. (One thing you can do is pack a material index into the G-Buffer.) Fortuantely, since X-Plane doesn't have such a multi-shader beast we only had one property to save: a shininess ratio (0-1, about 8 bits of precision needed).

What we did have to support was both a full albedo and a full RGB emissive texture; additive emissive textures have been in the sim for over a decade, and authors use them heavily. (They can be modulated by datarefs within the sim, which makes them useful for animating and modulating baked lighting effects.) X-Plane 10 also has a static shadow/baked ambient occlusion type term on some of the new scenery assets that needs to be preserved, also with 8 bits of precision.

The other challenge is that X-Plane authors use alpha translucency fairly heavily; deferred renderers don't do this very well. One solution we have for airplanes is to pull out some translucent surfaces and render them post-deferred-renderer as a forward over-draw pass. But for surfaces that can't be pulled out, we need to do the least-bad thing.

The Format

Thus we pack our G-Buffer into 16 bytes:
  1. RGBA8 albedo (with alpha)
  2. RG16F normal (Z vector is reconstructed)
  3. RG16F depth + shadow-merged-with-shininess
  4. RGBA8 emissive (with alpha)
When drawing into the G-Buffer, we use full blending for the albedo and emissive layers, but we always set the alpha for the depth/normal layers to 0.0 or 1.0, thus using the alpha blend as a per-render-target "kill" switch. This is based on the level of translucency. Thus if something is highly transparent, we keep the physical position of what is behind it (light passes through) but if it is opaque enough, we over-write the position/normal (light bounces off of it). It's not perfect, but it's the least bad thing I could come up with.

(As a side note, if we had a different layout, we could blend the shininess ratio, for example, when we keep a physical position fragment, to try to limit shininess on translucent elements.)

Note that on OS X 10.5 red-green textures are not available, so we have to fall back to four RGBA_16F textures, doubling VRAM. This costs us at least 20% fill-rate on an 8800, but the quick sanity test I did wasn't very abusive; heavier cases are probably a bit worse.

So far we seem to be surviving with a 16-bit floating point eye-space depth coordinate. It does not hold up very well in a planet-wide render though, for rendering techniques where reconstructing the position is important (e.g. O'Neil-style atmospheric scattering). A simple workaround would be to calculate the intersection of the fragment ray with the planet directly by doing a transform of the planet sphere from model-view space. (E.g. if we know that our fragment came from a sphere, why not just work with the original mathematical sphere.)

16F depth does give us reasonably accurate shadows, at least up close, and far away the shadows are going to be limited in quality anyway. I tried logarithmic depth, but it made shadows worse.

Packing Shadowing and Shine

For the static shadow/AO ("shadow") term and the level of specularity ("shine") we have two parameters that need about 8 bits of precision, and we have a 16-bit channel. Perfect, right? Well, not so much.
  • NVidia cards won't render to 16-bit integer channels on some platforms.
  • ATI cards don't export a bit-cast from integer to floating point, making it hard to pack a real 16-bit int (or two 8-bit ints) into a float.
  • If we change the channel to RGBA8 (and combine RG into a virtual 16F) we can't actually use the fourth byte (alpha) because the GL insists on dumping our alpha values. Extended blend would fix this but it's not supported on OS X and even on Windows you can't use it with multiple render targets.
So we can't actually get the bits we pay for and that sucks. But we can cheat. The trick is: the deferred renderer will cut out specular hilights that are in shadow. Thus as the shadow term becomes stronger, the shine term becomes unimportant.

So we simply encode 256.0 * shadow + shine. The resulting quantity gives shininess over 8 bits of precision without shadow, and reduces shininess to around 2 bits in full shadow. If you view the decoded channels separately you can see banding artifacts come in on the shine channel as the shadows kick in. But when you view the two together in the final shader, the artifacts aren't visible at all because the shadow masks out the banded shininess.

(What this trick has done is effectively recycle the exponent as a 'mixer', allocating bits between the two channels.)

Future Formats

A future extension would be to add a fifth 4-byte target. This would give us room to extend to full 32-bit floating point depth (should that prove to be useful), with shadow and shine in 8 bit channels with one new 8-bit channel left. Or alternatively we could:
  • Keep Z at 16F and keep an extra channel.
  • Technically if we can accept this 'lossy' packing we can get four components into an RG16F, while we can only get 3 pure 8-bit components. (This is due to how the GL manages alpha.)
  • If we target hardware that provides float-to-int bit casts, we could have four 8-bit components in the new channel.

Saturday, December 04, 2010

Semaphore Follow-Up: NTPL

A quick follow-up from my previous post on condition variables, etc. With NTPL (the pthreads implementation on Linux) a lot of the original issues I was trying to cope with don't exist. Some things NTPL does:
  • pthread mutexes are spin-sleep locks, so they can be used as short-term critical sections without too much trouble. Given a moderately contested but shortly held lock, this is a win.
  • sem_t semaphores have an atomic counter to avoid system calls in the uncontested case. When inited privately (sem_init) they appear to be lean and mean.
  • All synchronization is done around futexes, ensuring that uncontested cases can be manged with atomic operations. (The OS X pthreads library at least uses spin locks around user space book-keeping for the uncontested case, but I think the futex code path is faster.)
There is one case where using a condition variable really would be superior to a semaphore on Linux: if you really want a condition variable (and aren't just using it to build a semaphore). In particular, the futex system call that helps NTPL sleep threads as needed has special operations to move a thread from one queue to another while asleep. This fixes the thundering herd problem when every thread on a condition is woken up at once. This isn't something I use, but if you need it, NTPL makes it fast.

Friday, December 03, 2010

Performance of Semaphore Vs. Condition Variable

I was looking at the performance of X-Plane's inter-thread messaging. Inter-thread message costs haven't been a huge concern to us in the past because the jobs send to worker threads tend to be quite large, amortizing overhead, and they are usually returned asynchronously, so we don't care about latency.

X-Plane 10 features a threaded flight model (or rather, the flight model of all airplanes is executed in parallel). This is a case where we do care about latency (since we can't proceed until the flight model completes*) and the job size is not that big, so overhead matters more.

Our old message queue was based on a pthread condition variable. It was taking about 200 µsec to launch all workers, and a Shark analysis showed what I can only describe as a "total train wreck" of lock contention. The workers were waking up and immediately crashing into each other trying to acquire the mutex that protects the condition variable's "guts".

I replaced the condition variable + mutex implementation of the message queue with the implementation that already ships on Windows: a critical section to protect contents plus a semaphore to maintain count. I implemented the semaphore as an atomic wrapper around a mach semaphore. (See the libdispatch semaphore code for an idea of how to do this.) The results were about 2x better: 80 µsec to launch from sleep, about 45 µsec of which were spent signaling the mach semaphore (5-10 µsec per call) plus a little bit of overhead.

So why was the new implementation faster? I dug into the pthreads source to find out.

What Happens With PThreads

The pthread implementation on OS X (at least as of 10.5.x whose source I browsed, it's in libc btw) goes something like this:
  • A spin lock protects the "guts" of both a pthread mutex and a pthread condition variable.
  • An OS semaphore provides the actual blocking for a pthread mutex and a pthread condition variable.
Mutex lock/unlock is therefore not as good as a real spinlock (for the cases where spinlocks are, um, good) but it does have a fast path where, if the lock is uncontested, you can get in, take the lock and get out without ever having to make a system call.

The condition variable sequence is a little bit trickier. A signal to a condition variable with no waiters is a fast early exit, but we have to grab the spin lock twice if we signal someone. The condition wait is the painful part. pthread_cond_wait has to reacquire the locking mutex once the waiter is signaled. This is the source of the deadlock; if a lot of messages are written to a lot of worker threads at once, the worker threads bottleneck trying to reacquire the mutex that is tied to the condition variable that woke them up.

Better With Semaphores

When using the atomic semaphore/critical section design, the advantage we get is significantly finer-grained locking. The lock on our message queue now only protects the queue items themselves, and is never held while we are inside pthread code. This gets us through significantly faster. It is inevitable that if we are going to queue work to many threads via one queue, there's going to be some bottleneck trying to get the messages out. This design minimizes the cost of getting a message out and thus minimizes the serialization path.

(Shark doesn't have quite the res to measure it in the code example I have running now, but I suspect that some of the cost still in the system is the spin time waiting for messages to be pulled.)

Part of the win here is that we are using a spin lock rather than a sleep lock on the message queue; while the pthread mutex will fast-case when uncontested, it will always sleep the thread if even two threads are trying to acquire the mutex. If the protected piece of code is only a few instructions, a spin is a lot cheaper than the system call to block on a semaphore.

Better Throughput When Pre-Queued

One of the things that made the old design go haywire in the version 10 test case was that it was a two-level set of messsage queues, with the main job list containing a message to go check another queue for the actual work. (Believe it or not there is a good design reason for this, but that's another post.) In this two-level design the semaphore + critical section hits its fast case. The second queue is already full by the time we start waking up workers. Therefore the second queue's semaphore is positive, which means that the decrement in the message dequeue will hit the fast atomic operations case and be non-blocking. The workers then simply have to grab the spin lock, pop the message, and be done.

(Again, the pthread implementation can fast-case when the mutex is uncontested, but it doesn't spin.)

To forestall the comments: this is not a criticism of the performance of pthread condition variables and mutices; performance could certainly also have been boosted by using a different set of pthread synchronizers. It is mainly an observation that, because the pthread mutex is heavier than a spin lock, condition variables may be expensive when you wanted a light-weight lock but didn't get one.

Future Performance Boosts

There is one additional performance boost I have been looking at, although I am not sure when I will put it into place. The message queue (which is a lock around an STL list right now**) could be replaced with a ring buffer FIFO. With the FIFO design we would maintain:
  • A pair of semaphores, counting filled and free entries. Reading/writing blocks until the needed filled/free entry is available.
  • Atomic read/write pointers that can be advanced once the semaphore indicates that there is content/free space to advance to.
This design would have the advantage that it doesn't need to spin, with message read/write through the ring buffer being almost fully independent. Furthermore, since we have atomics wrapping our semaphores, the non-blocking case (reading when there are messages, writing when there is free space) is -blocking and non-spinning.

Two things stop me from dropping this in:
  • Our current message queue API assumes that messages can always be written and that writing is reasonably fast. (The current write only needs to acquire a spin lock.) If the ring buffer can fill, we could block indefinitely until space is available, or we have to expose to client code that writes can time out. That's a pretty big change.
  • The current queue has a lock-and-edit operation that is used to optimize the case where queued jobs have to be flushed. Since the ring buffer FIFO is lock free, we can't really lock it to inspect/edit it.
* We looked at implementing a truly asynchronous flight model, but the interaction between third party add-ons, the flight model, and the rendering engine was such that the synchronization issues were too large for a relatively small win (the reduction of latency of one flight model). flight models against each other gets us most of the win in the cases where the flight model really costs us something.