- Flow control in the shader. This is good for an application that has poor OpenGL state sorting, because you can now draw everything with one shader and let conditionals sort it out. But ... conditionals are not very fast (and in some hardware, not very existent), which leads to the other option...
- Create smaller specialized shaders by pre-processing. This essentially is a loop-unrolling of option 1 - if an app has good state sorting, it makes sense to produce a specialized small shader for each case.
In our case, the üshader accomplishes a number of nice things:
- We wrap it in an abstraction layer, so that all graphics hardware looks "the same" to X-Plane - we even have a fixed function implementation of the üshader interface. (This means: we have to live with whatever simplifications the fixed-function implementation makes - it is very hard for client code to know how its drawing request is "dumbed down". But we can live with this.)
- The abstraction layer contains the OpenGL setup code for the shader, which keeps the huge pile of uniform setups and binds in one place. One problem with GLSL development is that you end up with two separate chunks of code (the GL setup and the GLSL shader) that can't be out of sync. The üshader keeps the GLSL in one place, and the abstraction wrapper keeps the GL setup in one place.
- The abstraction layer presents a very orthogonal view of features, which helps us build a rendering engine that "does the right thing", e.g. if basic terrain can have cloud shadows, then two-texture terrain, runway terrain, road surface terrain, and every other type can also have cloud shadows. Given the ability to plug user-authored content into the sim, we can't predict what combinations we might need to support.
- The abstraction layer can build the shaders lazily, which keeps the number of shaders we keep around down to sane limits. This is slightly frightening in that we don't (normally) generate every possible shader - conceivably there could be GLSL compile errors in certain odd combinations.
We must consider the question: does this technique produce good GLSL shaders? Well, it clearly produces better shaders than simply building a non-specialized üshader, which would contain a lot of extra crud. The big concern for a üshader would be that the different parts of the shader, when glued together, would not produce optimal code.
So far I have found this to be not a huge problem. GLSL compilers have come a long way - it appears to me that the days of having to hand-roll GLSL to optimize are over. A year or two ago, I would be combining expressions to improve code performance; now I use temporaries and let the compiler figure it out. That's enough compiler smarts to glue shader parts together as long as they really are necessary.
There are some cases that the compiler can't optimize that cause the conditionalized code to become overly complex. For example, if my shader outputs a varying and the next shader down the line doesn't use it, this is discovered at link time, so I don't expect to get any dead-code elimination.
There is one way to produce substantially better shader code, and that is to recognize when parts of the shader features are not necessary. For example, when we are doing a depth-only shadow render, we don't need to do any texture reads if the texture doesn't have an alpha channel.
One way to achieve this is to use a "feature optimizer" - that is, a specific piece of code that examines the naive requested shader state, and converts it to the true minimum required state (e.g. turning off texturing for non-alpha textures during depth-only shading). Given clients of the shader code all over the app, having one optimizer as part of the shader interface means consistent application of optimization all the time.
Perhaps more important than GLSL optimization is GL state optimization - making sure that we change as little as possible between batches for fastest processing. On the fastest machines, users need to increase "object count" (replicated objects whose meshes are resident in VRAM) to really utilize their GPU's power. At that point, state change becomes the critical path, even on a very fast machine.
Unfortunately, it can be difficult to tell what set of state transitions the renderer will have to make - it depends a lot on culling, as well as user-generated content.
One way to improve state change is to keep local caches of commonly changed state to avoid calls into the driver.
A more aggressive optimization is to provide "specific transition" functions for the shader. So as well as an API to "set up the shader with these settings", we have APIs that change specific commonly changed, easy to recognize small aspects of the shader, ideally using an optimized path. For example: often in an airplane we change just one texture (but no other state), and we know we are only changing the texture. We can avoid a lot of thrash by recognizing this and using an optimized code path.
Clash of the Optimizers
Here's where things get ugly: can we still recognize a fast transition situation when there is an optimizer trying to alter GL state? Naively, the answer is no. For example, consider the case where we eliminate texture read for no-alpha textures. If we now want to do a fast "texture only" change, we can only do this optimally if both textures have the same alpha channel (or lack thereof) - and in fact, the optimal case is to do absolutely nothing if both textures do not have alpha.