The Hacks of Life

sRGB, Pre-Multiplied Alpha, and Compression

2022-06-11T11:38:00.002-04:00

How do sRGB color-space textures (especially compressed ones) interact with blending? First, let's define some terms:

An image is "sRGB encoded" if the linear colors have been remapped using the sRGB encoding. This encoding is meant for 8-bit storage of the textures and normalized colors from 0.0 to 1.0. The "win" of sRGB encoding is that it uses more of our precious 8 bits on the dark side of the 0..1 distribution, and humans have weird non-linear vision where we see more detail in that range.

Thus sRGB "spends our bits where they are needed" and limits banding artifacts. If we just viewed an 8-bit color ramp from 0 to 1 linearly, we'd see stripes in the darks and perceive a smooth color wash in the hilights.

Alpha values are always considered linear, e.g. 50% is 50% and is stored as 127 (or maybe 128, consult your IHV?) in an 8-bit texture. But how that alpha is applied depends on the color space where blending is performed.

If we decode all of our sRGB textures to linear values (stored in more than 8 bits) and then blend in this linear space, we have "linear blending" (sometimes called "gamma-correct" blending - the terminology is confusing no matter how you slice it). With linear blending, translucent surfaces blend the way light blends - a mix of red and green will make a nice bright yellow. Most game engines now work this way, it's pretty much mandatory for HDR (because we're not in a 0..1 range, we can't be in sRGB) and it makes lighting effects look good.
If we stay in the 8-bit sRGB color space and just blend there, we get "sRGB blending". Between red and green we'll see a sort of dark mustard color and perceive a loss of light energy. Lighting effects blended in sRGB look terrible, but sometimes artists want sRGB blending. The two reasons I've heard from my art team are (1) "that's what Photoshop does" and (2) they are trying to simulate partial coverage (e.g. this surface has rusted paint and so half of the visible pixel area is not showing the paint) and blending in a perceptual space makes the coverage feel right.

A texture has pre-multiplied alpha if the colors in the texture have already been multiplied (in fractional space) by the alpha channel. In a texture like this, a 50% translucent area has RGB colors that are 50% darker. There are two wins to this:

You can save computing power - not exciting for today's GPUs, but back when NeXT first demonstrated workstations with real-time alpha compositing on the CPU (in integer of coarse), premultiplication was critical for cutting ALU by removing half the multiplies from the blending equation.
Premultiplied alpha can be filtered (e.g. two samples can be blended together) without any artifacts. The black under clear pixels (because they are multiplied with 0) is the correct thing to blend into a neighboring opaque pixel to make it "more transparent" - the math Just Works™. With non-premultiplied textures, the color behind clear pixels appears in intermediate samples - tool chains must make sure to "stuff" those clear texels with nearby colors.

Pre-Multiplication and sRGB

Can we pre-multiply sRGB encoded textures that are intended to be decoded and used in a linear renderer?Yes! But, here's the catch: we can only do this right if we know the type of blending we will do.

Pre-multiplying is basically doing a little bit of the blending work ahead of time. Therefore it's actually not surprising that we need to know the exact math of the blend to do this work in advance. There are two cases:

If we are going to blend in sRGB (e.g. nothing is ever decoded) we can just pre-multiply our 8-bit colors with 8-bit alpha and go home happy. This is what we did in 2005 and we liked it, because it's all there was.

If we are going to decode our sRGB to linear space, then blend, we have to do something more complex: when pre-multiplying our texture we need to:

Decode our colors to linear (and use more than eight bits to do the following intermediate calculations).
Multiply the alpha value by the linear color.
Re-encode the resulting darkened color back to sRGB.

We have now baked the multiply that would have happened in normal linear blending. This should be okay in terms of precision - while we are likely to have darker results post-blending, sRGB is already using bits for those darker areas, and they're going to have less weight on screen.

The fine print here is that the "bake" we did to convert from non-premultiplied to pre-multiplied has different math based on the blending color space and that decision is permanent. While a non-premultiplied texture can be used for either blending, once you pre-multiply, you're committed.

Does Compression Change Anything?

What if our sRGB textures are going to be compressed? In theory, no - S3TC compression states that the block is decompressed into sRGB color space first, then "other stuff happens". And this is useful - with only 16 bits for block color end-points, S3TC blocks need to spread their bits as evenly as possible from a perceptual perspective.

In practice, I'd be very wary. DXT3/5 provide two separate compression blocks, one for color, and one for alpha. But with premultiplied alpha, the alpha mask has been "baked" into the color channel and are going to put pressure on end-point selection.

Consider the case of a 4x4 tile that goes from clear to opaque as we go left to right, and green to blue as we go from bottom to top.

Without pre-multiplication, we pick green and blue as color end points, clear and opaque as alpha end points, and we can get very accurate compression.

With pre-multiplication we're screwed. We either pick black as an end point (so we can have black in the color under the clear pixels) and our color end point has to be a single green/blue mash-up, destroying any color change, or we pick green and blue as our colors and we have "light leak" when our colors aren't dark under the alpha.

For this reason, I think pre-multiplying isn't appropriate for block compressors even if we get linear blending correct.

This One Weird Trick Let's You Beat the Experts

2021-10-29T16:52:00.000-04:00

As I've mentioned in the past, one of my goals with this blog is to launch my post-development career in tech punditry and eventually author an O'Reilly book with an animal on the cover.* So today I'm going to add click-baity titles to my previous pithy quotes and tell you about that one weird trick that will let you beat the experts and write faster code than the very best libraries and system implementations.

But first, a Vox-style explanatory sub-heading.

StackOverflow is a Website That Destroys New Programmers' Hopes and Dreams

Here is a typical question you might see on Stack Overflow**:

I'm new to programming, but I have written a ten line program. How can I wrote my own custom allocator for my new game engine? I need it to be really fast and also thread safe.

StackOverflow is a community full of helpful veteran programmers who are there for the newbies among us, so someone writes an answer like this.***

Stop. Just stop. Unplug your keyboard and throw it into the nearest dumpster - it will help your game engine to do so. The very fact that you asked this question shows you should not be trying to do what you do.

You cannot beat the performance of the system allocator. It was written by a Linux programmer with an extraordinarily long beard. That programmer spent over 130 years studying allocation patterns. To write the allocator, he took a ten year vow of silence and wrote the allocator in a Tibetan monastery using the sand on the floor as his IDE. The resulting allocator is hand-optimized for x86, ARM, PDP-11 macro-assembler, and the Zilog Z-80. It uses non-deterministic finite state automata, phase-induced reality distortion fields, atomic linked lists made from enriched Thorium, and Heisenberg's uncertainty principle. It is capable of performing almost 35 allocations per second. You are not going to do better.

Discouraged, but significantly wiser, our young programmer realizes that his true calling in life is to simply glue together code other people have written, and vows to only develop electron apps from this point forward.

But what if I told you there was this one weird trick that would allow our newbie programmer to beat the system allocator's performance?

The Secret of My Success

Come closer and I will whisper to you the one word that has changed my life. This one thing will change how you code forever. Here's what our young programmer needs to do to beat the system allocator:

Cheat.

Here, I'll show you how it's done.

static char s_buffer[1024];
void * cheat_malloc(size_t bytes) { return s_buffer; }
void cheat_free(void * block) { }

(Listing 1 - the world's worst allocator.)

One thing you cannot deny: this code is a lot faster than malloc.

Now you might be saying: "Ben, that is literally the worst allocator I have ever seen" (to which I say "Hold my beer") but you have to admit: it's not that bad if you don't need all of the other stuff malloc does (like getting blocks larger than 1K or being able to call it more than once or actually freeing memory). And really fast. Did I mention it was fast?

And here we get to the pithy quotes in this otherwise very, very silly post:

You might not need the full generality and feature set of a standard solution.

and

You can write a faster implementation than the experts if you don't solve the fully general problem.

In other words, the experts are playing a hard game - you win by cheating and playing tic tac toe while they're playing chess.

All That Stuff You Don't Need

A standard heap allocator like malloc does so much stuff. Just look at this requirements list.

You can allocate any size block you want. Big blocks! Small blocks! Larger than a VM page. Smaller than a cache line. Odd sized blocks, why not.
You can allocate and deallocate your blocks in any order you want. Tie your allocation pattern to a random number generator, it's fine, malloc's got your back.
You can free and allocate blocks from multiple threads concurrently, and you can release blocks on different threads than you allocated them from. Totally general concurrency, there are no bad uses of this API.

The reason I'm picking on these requirements is because they make the implementation of malloc complicated and slow.****

One of the most common ways to beat malloc that every game programmer knows is a bump allocator. (I've heard this called a Swedish allocator, and it probably has other names too.) The gag is pretty simple:

You get a big block of memory and keep it around forever. You start with pointer to the beginning.
When someone wants memory, you move the pointer forward by the amount of memory they want and return the old pointer.
There is no free operation.
At the end of a frame, you reset the pointer to the beginning and recycle the entire block.
If you want this on multiple threads, you make one of these per thread.

This is fast! Allocation is one add function, freeing is zero operations, and there are no locks. In fact, cache coherency isn't that bad either - successive allocations will be extremely close together in memory.

Our bump allocator is only slightly more complex than the world's worst allocator, but it shares a lot in common: it's faster because it doesn't provide all of those general purpose features that the heap allocator implements. If the bump allocator is too special purpose, you're out of luck, but if it fits your design needs, it's a win.

And it's simple enough you can write it yourself.

General Implementations are General

You see the same thing with networking. The conventional wisdom is: "use TCP, don't reinvent TCP", but when you look into specific domains where performance matters (e.g. games, media streaming) you find protocols that do roll their own, specifically because TCP comes with a bunch of overhead to provide its abstraction ("the wire never drops data") that are expensive and not needed for the problem space.

So let me close with when to cheat and roll your own. It makes sense to write your own implementation of something when:

You need better performance than a standard implementation will get you and
Your abstraction requirements are simpler or more peculiar than the fully general case and
You can use those different/limited requirements to write a faster implementation.

You might need a faster implementation of the fully general case - that's a different blog post, one for which you'll need a very long beard.

* The animal will be some kind of monkey flinging poop, obviously.

** Not an actual StackOverflow question.

*** Not an actual StackOverflow answer. You can tell it's not realistic because the first answer to any SO question is always a suggestion to use a different language.

**** I mean, not that slow - today's allocators are pretty good - and certainly better than I can write given those requirements. But compared to the kinds of allocators that don't solve those problems, malloc is slower.

Is It Ever Okay to Future Proof?

2021-10-16T16:51:00.002-04:00

A while ago I tried to make a case for solving domain-specific problems, rather than general ones. My basic view is that engineering is a limited resource, so providing an overly general solution takes away from making the useful specific one more awesome (or quicker/cheaper to develop).

Some people have brought up YAGNI, with opinions varying from "yeah, YAGNI is spot on, stop building things that will never get used" to "YAGNI is being abused as an excuse to just duct tape old code without ever paying off technical debt."

YAGNI is sometimes orthogonal to generalization. "Solving specific problems" can mean solving two smaller problems separately rather than trying to come up with one fused super-solution. In this case, there's no speculative engineering or future proofing, it's just a question of whether the sum can be less than the parts. (I think the answer is: Sometimes! It's worth examining as part of the design process.)

But sometimes YAGNI is a factor, e.g. the impulse to solve a general problem comes from an assumption that this general design will cover future needs too. Is that ever a good thing to assume?

Nostradamus the Project Manager

So: is it ever okay to future proof? Is it ever okay to have design requirements for a current engineering problem to help out with the next engineering problem? Here are three tests to apply - if any of them fail, don't future proof.

Do you for a fact that the future feature is actually needed by the business? If not, put your head down and code what needs to be done today.
Do you know exactly how you would solve the future feature efficiently with today's code? If not, back away slowly.
Would the total engineering cost of both features be, in sum, smaller if you do some future proofing now? If not, there's no win here.

Bear in mind, even if you pass all of these three tests, future proofing might still be a dumb idea. If you are going to grow the scope of the feature at all by future proofing, you had best check with your team about whether this makes sense. Maybe time is critical or there's a hard ship date. Keep it tight and don't slip schedule.

I find the three tests useful to cut off branches of exploration that aren't going to be useful. If my coworker is worrying that the design is too specific and can't stop playing "what if", these tests are good, because they help clarify how much we know about the future. If the answer is "not a ton", then waiting is the right choice.

And I think it's worth emphasizing why this works. When solving new coding problems, one of the ways we learn about the problem space and its requirements is by writing code. (We can do other things like create prototypes and simulations, test data, etc. but writing code absolutely is a valid way to understand a problem.)

The act of writing feature A doesn't just ship feature A - it's the R&D process by which you learn enough to write feature B. So fusing A & B's design may not be possible because you have to code A to learn about B. These questions hilight when features have this kind of "learning" data dependency.

Sometimes Ya Are Gonna Need It

With that in mind, we do sometimes future proof code in X-Plane. This almost always happens when we are designing an entire subsystem, we know its complete feature set for the market, but there is an advantage to shipping a subset of the features first. When I wrote X-Plane's particle system, we predicted (correctly) that people want to use the effects on scenery and aircraft, but we shipped aircraft first with a design that wouldn't need rewrites to work with scenery.

Given how strong the business case is for scenery effects, and given that we basically already know how this code would be implemented on its own (almost exactly the same as the aircraft code, but over here), it's very cheap to craft the aircraft code with this in mind.

But it requires a lot of really solid information about the future to make this win.

A Coroutine Can Be An Awaitable

2021-06-21T15:48:00.002-04:00

This is part five of a five part series on coroutines in C++.
Getting Past the Names
Coroutines Look Like Factories
co_await is the .then of Coroutines
We Never Needed Stackful Coroutines
A Coroutine Can Be An Awaitable

The relationship between awaitables and coroutines is a little bit complicated.

First, an awaitable is anything that a coroutine can "run after". So an awaitable could be an object like a mutex or an IO subsystem or an object representing an operation (I "awaited" my download object and ran later once the download was done).

I suppose you could say that an awaitable is an "event source" where the event is the wait being over. Or you could say events can be awaitables and it's a good fit.

A coroutine is something that at least partly "happens later" - it's the thing that does the awaiting. (You need a coroutine to do the awaiting because you need something that is code to execute. The awaitable might be an object, e.g. a noun.)

Where things get interesting is that coroutines can be awaitables, because coroutines (at least ones that don't infinite loop) have endings, and when they end, that's something that you can "run after". The event is "the coroutine is over."

To make your coroutine awaitable you need to do two things:

Give it a co_await operator so the compiler understands how to build an awaitable object that talks to it and
Come up with a reason to wake the caller back up again later.

Lewis Baker's cppcoro task is a near-canonical version of this.

The tasks start suspended, so they run when you co_await them.
They use their final_suspend object to resume the coroutine that awaited them to start them off.

Thus tasks are awaitables, and they are able to await (because they are croutines) and they can be composed indefinitely.

While any coroutine can be an awaitable, they might no be. I built a "fire and forget" async coroutine that cleans itself up when it runs off the end - it's meant to be a top level coroutine that can be run from synchronous code, thus it doesn't try to use coroutine tech to signal back to its caller. The actual C++ in the coroutine need to decide how to register their final results with the host app, maybe by calling some other non-coroutine execution mechanism.

We Never Needed Stackfull Coroutines

2021-06-20T15:00:00.002-04:00

This is part for of a five part series on coroutines in C++.
Getting Past the Names
Coroutines Look Like Factories
co_await is the .then of Coroutines
We Never Needed Stackful Coroutines
A Coroutine Can Be An Awaitable

You've probably seen libraries that provide a "cooperative threading" API, e.g. threads of execution that yield to each other at specific times, rather than when the Kernel decides to. Windows Fibers, Boost Fibers, ucontext on Linux. I've been programming for a very long time, so I have written code for MacOS's cooperative "Thread Manager", back before cooperative multi-tasking was cool - at the time we were really annoyed that there was no pre-emption and that one bad worker thread would hang your whole machine. We were truly savages back then.

The gag for all of these "fiber" systems is the same - you have these 'threads' that only yield the CPU to another one via a yield API call - the yield call may name its successor or just pick any runnable non-running thread. Each thread has its own stack, thus our function (and its callstack and all state) is suspended mid-execution at the yield call.

Sounds a bit like coroutines, right?

There are two flavors of coroutines - stackful, and stackless. Fibers are stackful coroutines - the whole stack is saved to suspend the routine, allowing us to suspend at any point in execution anywhere.

C++20 coroutines are stackless, which means you can only suspend inside a coroutine itself - suspension depends on the compiler transforms of the coroutine (into a state machine object) to achieve the suspend.

When I first heard this, I thought: "well, performance is good but I wonder if I'll miss having full stack saves."

As it turns out, we don't need stack saves - here's why.

First, a coroutine is a function that has at least one potential suspend point. If there are zero potential suspend points, it's just a function, like from the 70s with the weed an the K&R syntax, and to state the obvious, we don't have to worry about a plain old function trying to suspend on us.

So if a coroutine calls a plain old function, the function can't suspend, so not being able to save the whole stack isn't important. We only need ask: "what happens if a coroutine calls a coroutine, and the child coroutine suspends?"

Well, we don't really "call" a child coroutine - we co_await it! And co_awaiting Just Works™.

When our parent coroutine co_awaits the child routine, the child coroutine (via the awaitable mediating this transfer of execution) gets a handle to the parent coroutine to stash for later. We can think of this as the "return address" from a stack function call, and the child coroutine can stash it somewhere in its promise storage.
When the child coroutine co-awaits (e.g. on I/O) this works as expected - we don't return to the parent coroutine because it's already suspended.
When the child coroutine finishes for real, its final suspend awaitable can return the parent coroutine handle (that we stashed) to transfer control back to the parent - this is the equivalent of using a RET opcode to pop the stack and jump to the return address.

In other words, our chain of co-routines forms a synthetic stack-like saving structure - the coroutine frames form a linked list (from children to parents) via the stashed handle to the coroutine) and each frame stores state for that function.

co_await is the .then of Coroutines

2021-06-19T21:00:00.002-04:00

This is part three of a five part series on coroutines in C++.
Getting Past the Names
Coroutines Look Like Factories
co_await is the .then of Coroutines
We Never Needed Stackful Coroutines
A Coroutine Can Be An Awaitable

One more intuition about coroutines: the co_wait operator is to coroutines as .then (or some other continuation queueing routine) is to a callback/continuation/closure system.

In a continuation-based system, the fundamental thing we want to express is "happens after". We don't care when our code runs as long as it is after the stuff that must run before completes, and we'd rather not block or spin a thread to ensure that. Hence:

void dump_file(string path)
{
future<string> my_data = file.async_load("~/stuff.txt");
my_data.then([my_data](){
printf("File contained: %s\n", my_data.get().c_str());
});
}

The idea is that our file dumper will begin the async loading process and immediately return to the caller once we're IO bound. At some time later on some mystery thread (that's a separate rant) the IO will be done and our future will contain real data. At that point, our lambda runs and can use the data.

The ".then" API ensures that the file printing (which requires the data) happens after the file I/O.

With coroutines, co_await provides the same service.

async dump_file(string path)
{
string my_data = co_await file.async_load("~/stuff.txt");
printf("File contained: %s\n", my_data.c_str());
}

Just like before, the beginning of dump_file runs on the caller's thread and runs the code to begin the file load. Once we are I/O bound, we exit all the way back to the caller; some time later the rest of our coroutine will run (having access to the data) on a thread provided by the IO system.

Once we realize that co_await is the .then of coroutines, we can see that anything in our system with an async continuation callback could be an awaitable. A few possibilities:

APIs that move execution to different thread pools ("run on X thread")
Non-blocking I/O APIs that call a continuation once I/O is complete.
Objects that load their internals asynchronously and run a continuation once they are fully loaded.
Serial queues that guarantee only one continuation runs at a time to provide non-blocking async "mutex"-like behavior.

Given an API that takes a continuation, we can transform it into an awaitable by wrapping the API in an awaitable with a continuation that "kicks" the awaitable to resume the suspended coroutine.

Awaitables are also one of two places where you get to find out who any given coroutine is - await suspend gives you the coroutine handle of the client coroutine awaiting on you, which means you can save it and put it in a FIFO or anywhere else for resumption later. You also get to return a coroutine handle to switch to any other code execution path you want.

For example, a common pattern for a coroutine that computes something is:

Require the client to co_await the coroutine to begin execution - the awaitable that does this saves the client's coroutine handle.
Use a final_suspend awaitable to return to the client whose coroutine we stashed.

(This is what cppcoro's task does.)

The other place we can get a coroutine handle is in the constructor of our promise, which can use from_promise to find the underlying coroutine it is attached to, and pass it to the return type, allowing handles to coroutines to connect to their coroutines.

Coroutines Look Like Factories

2021-06-18T09:00:00.002-04:00

This is part two of a five part series on coroutines in C++.
Getting Past the Names
Coroutines Look Like Factories
co_await is the .then of Coroutines
We Never Needed Stackful Coroutines
A Coroutine Can Be An Awaitable

In my previous post I complained about the naming of the various parts of coroutines - the language design is great, but I find myself having to squint at the parts sometimes.

Before proceeding, a basic insight about how coroutines work in C++.

You write a coroutine like a function (or method or lambda), but the world uses it like a factory that returns a handle to your newly created coroutine.

One of the simplest coroutine types I was able to write was an immediately-running, immediately-returning coroutine that returns nothing to the caller - something like this:

class async {
public:
struct control_block {
      auto initial_suspend() { return suspend_never(); }
    auto final_suspend() { return suspend_never(); }
      void return_void() {}
      void unhandled_exception() { }
};
using promise_type = control_block;
};

The return type "async" returns a really useless handle to the client. It's useless because the coroutine starts on its own and ends on its own - it's "fire and forget". The idea is to let you do stuff like this:

async fetch_file(string url, string path)
{
string raw_data = co_await http::download_url(url);
co_await disk::write_file_to_path(path, raw_data);
}

In this example, our coroutine suspends on IO twice, first to get data from the internet, then to write it to a disk, and then it's done. Client code can do this:

void get_files(vector<pair<string,string>> stuff)

{

for(auto spec : stuff)

{

fetch_file(spec.first,spec.second);
}
}

To the client code, fetch_file is a "factory" that will create one coroutine for each file we want to get; that coroutine will start executing using the caller for get_files, do enough work to start downloading, and then return. We'll queue a bunch of network ops in a row.

How does the coroutine finish? The IO systems will resume our coroutine once data is provided. What thread is executing this code? I have no idea - that's up to the IO system's design. But it will happen after "fetch_file" is done.

Is this code terrible? So first, yes - I would say an API to do an operation with no way to see what happened is bad.

But if legacy code is callback based, this pattern can be quite useful - code that launches callbacks typically put the finalization of their operation in the callback and do nothing once launching the callback - the function is fire and forget because the end of the coroutine or callback handles the results of the operation.

C++ Coroutines - Getting Past the Names

2021-06-17T11:12:00.003-04:00

This is part one of a five part series on coroutines in C++.
Getting Past the Names
Coroutines Look Like Factories
co_await is the .then of Coroutines
We Never Needed Stackful Coroutines
A Coroutine Can Be An Awaitable

I really like the C++20 coroutine language features - coroutines are a great way to write asynchronous code, and the language facilities give us a lot of power and client code that should be very manageable to write.

My one gripe is with the naming of a few of the various parts of the coroutine control structures. This shouldn't matter that much because only library writers have to care, but right now we're in the library wild west (which is fine by me - I'm old enough to be very skeptical of "the standard library should have everything" and happy to roll my own) so we can't avoid them.

A coroutine's overall behavior is mediated by its declared return type, which will typically be a struct or class from library code that looks something like this:

template<typename T>
struct task {
struct task_promise_thingie {
auto initial_suspend() { return std::suspend_always(); }
...
};
using promise_type = task_promise_thingie;
};

The inner struct (it could be separate but the nested structure expresses the relationship to client code in ar reasonable way I think) is called the "promise type" and ... man, do I hate that name. Partly because I consider the whole promise-future async model from C++11 to be bad, and partly because the promise type" does a lot of things, of which the promise might be the least important.

I'm not sure what I would have called the promise type, but I'd lean toward "traits" or "control block", because the promise type does a few key things for you:

It defines the beginning and end of the lifecycle of your co-routine. Does the coroutine immediately start when created, or does it immediately suspend until someone kicks it? Does the coroutine run off the end and die on its own or wait until someone kills it off.
It defines the flow control on the CPU at the beginning and end of the co-routine. Because the beginning and end of a coroutine's life are controlled by awaitables, you can launch an arbitrary coroutine at either point! For example, if the coroutine delivers its output to another coroutine, you can, at final suspend, resume that parent co-routine on whatever thread you ended on.
It defines the "handle" type that client code will see when creating the coroutine.

This brings me to my second rant - the outer handle that client code gets when "calling" the coroutine (which is really constructing it) is called the "return type". This is harder to yell about because it is...a return type...but once again, I think this isn't the most important thing.

I would say the most important thing about the coroutine's return type is that it's the only access to the coroutine that client code gets. So if the client code that calls/builds the coroutine is going to have any further interaction with it, the return type is the "handle". Left to my own byzantine naming instincts, I might have called it a "client handle" or something.

Making SoA Tollerable

2021-02-27T15:27:00.004-05:00

Chandler Caruth (I think - I can't for the life of me find the reference) said something in a cppcon talk years ago that blew my mind. More or less, 95% of code performance comes from the memory layout and memory access patters of data structures, and 5% comes from clever instruction selection and instruction stream optimization.

That is...terrible news! Instruction selection is now pretty much entirely automated. LLVM goes into my code and goes "ha ha ha foolish human with your integer divide by a constant, clearly you can multiply by this random bit sequence that was proven to be equivalent by a mathematician in the 80s" and my code gets faster. There's not much I have to worry about on this front.

The data structures story is so much worse. I say "I'd like to put these bytes here" and the compiler says "very good sir" in sort of a deferential English butler kind of way. I can sense that maybe there's some judgment and I've made bad life choices, but the compiler is just going to do what I told it. "Lobster Thermidor encrusted in Cool Ranch Doritos, very good sir" and Alfred walks off to leave me in a hell of L2 cache misses of my own design that turn my i-5 into a 486.

I view this as a fundamental design limitation of C++, one that might someday be fixed with generative meta-programming (that is, when we can program C++ to write our C++, we can program it to take our crappy OOPy-goopy data structures and reorganize them into something the cache likes) but that is the Glorious Future™. For now, the rest of this post is about what we can do about it with today's C++.

There Is Only Vector

To go faster, we have to keep the CPU busy, which means not waiting for memory. The first step is to use vector and stop using everything else - see the second half of Chandler's talk. Basically any data structure where the next thing we need isn't directly after the thing we just used is bad because the memory might not be in cache.

We experienced this first hand in X-Plane during the port to Vulkan. Once we moved from OpenGL to Vulkan, our CPU time in driver code went way down - 10x less driver time - and all of the remaining CPU time was in our own code. The clear culprit was the culling code, which walks a hierarchical bounding volume tree to decide what to draw.

I felt very clever when I wrote that bounding volume tree in 2005. It has great O(N) properties and lets us discard a lot of data very efficiently. So much winning!

But also, it's a tree. The nodes are almost never consecutive, and a VTune profile is just a sea of cache misses each time we jump nodes. It's slow because it runs at the speed of main memory.

We replaced it with a structure that would probably cause you to fail CS 102, algorithms and data structures:

1. A bunch of data is kept in an array for a a sub-section of the scenery region.

2. The sub-sections are in an array.

And that's it. It's a tree of fixed design of depth two and a virtually infinite node count.

And it screams. It's absurdly faster than the tree it replaces, because pretty much every time we have to iterate to our next thing, it's right there, in cache. The CPU is good at understanding arrays and is going to get the next cache line while we work. Glorious!

There are problems so big that you still need O(N) analysis, non-linear run-times, etc. If you're like me and have been doing this for a long time, the mental adjustment is how big N has to be to make that switch. If N is 100, that's not a big number anymore - put it in an array and blast through it.

We Have To Go Deeper

So far all we've done is replaced every STL container with vector. This is something that's easy to do for new code, so I would say it should be a style decision - default to vector and don't pick up sets/maps/lists/whatever unless you have a really, really, really good reason.

But it turns out vector's not that great either. It lines up our objects in a row, but it works on whole objects. If we have an object with a lot of data, some of which we touch all of the time and some of which we use once on leap years, we waste cache space on the rarely used data. Putting whole objects into an array makes our caches smaller, by filling them up with stuff we aren't going to use because it happens to be nearby.

Game developers are very familiar with what to do about it - perhaps less so in the C++ community: vector gives us an array of structures - each object is consecutive and then we get to the next object; what we really want is a structure of arrays - each member of the object is consecutive and then we hit the next object.

Imagine we have a shape object with a location, a color, a type, and a label. In the structure of arrays world, we store 4 shapes by storing: [(location1, location2, location3, location4), (color 1, color 2, color3, color4), (type 1, type2, type3, type 4), (label 1, label2, label3, label4)].

First, let's note how much better this is for the cache. When we go looking to see if a shape is on screen, all locations are packed together; every time we skip a shape, the next shape's location is next in memory. We have wasted no cache or memory bandwidth on thing we won't draw. If label drawing is turned off, we can ignore that entire block of memory. So much winning!

Second, let's note how absolutely miserable this is to maintain in C++. Approximately 100% of our tools for dealing with objects and encapsulations go out the window because we have taken our carefully encapsulated objects, cut out their gooey interiors and spread them all over the place. If you showed this code to an OOP guru they'd tell you you've lost your marbles. (Of coarse, SoA isn't object oriented design, it's data oriented design. The objects have been minced on purpose!)

Can We Make This Manageable?

So the problem I have been thinking about for a while now is: how do we minimize the maintenance pain of structures of arrays when we have to use them? X-Plane's user interface isn't so performance critical that I need to take my polymorphic hierarchy of UI widgets and cut it to bits, but the rendering engine has a bunch of places where moving to SoA is the optimization to improve performance.

The least bad C++ I have come up with so far looks something like this:

struct scenery_thingie {

int count;

float * cull_x;

float * cull_y;

float * cull_z;

float * cull_radius;

gfx_mesh * mesh_handle;

void alloc(UTL_block_alloc * alloc, int count);

scenery_thingie& operator++();

scenery_thingie& operator+=(int offset);

};

You can almost squint at this and say "this is an object with five fields", and you can almost squint and this and say "this is an array" - it's both! The trick is that each member field is a base pointer into the first object (of count's) member field, with the next fields coming consecutively. While all cull_y fields don't have to follow cull_x in memory, it's nice if they do - we'd rather not have them on different VM pages, for example.

Our SoA struct can both be an array (in that it owns the memory and has the base pointers) but it can also be an iterator - the increment operator increments each of the base pointers. In fact, we can easily build a sub-array by increasing the base pointers and cutting the count, and iteration is just slicing off smaller sub-arrays in place - it's very cheap.

This turns out to be pretty manageable! We end up writing *iter.cull_x instead of iter->cull_x, but we more or less get to work with our data as expected.

Where Did the Memory Come From?

We have one problem left: where did the memory come from to allocate our SoA? We need a helper - something that will "organize" our dynamic memory request and set up our base pointers to the right locations. This code is doing what operator new[] would have done.

class UTL_block_alloc {

public:

UTL_block_alloc();

template<typename T>

inline void alloc(T ** dest_ptr, size_t num_elements);

void * detach();

};

Our allocation block helper takes a bunch of requests for arrays of T's (e.g. arbitrary types) and allocates one big block that allocates them consecutively, filling in dest_ptr to point to each one. When we call detach, the single giant malloc() block is returned to be freed by client code.

We can feed any number of SoA arrays via a single alloc block, letting us pack an entire structure of arrays of structures into one consecutive memory region. With this tool, "alloc" of an SoA is pretty easy to write.

void scenery_thingie::alloc(UTL_block_alloc * a, int in_count)

{

count = in_count;

a->alloc(&cull_x,c);

a->alloc(&cull_y,c);

a->alloc(&cull_z,c);

a->alloc(&cull_r,c);

a->alloc(&mesh_handle,c);

}

A few things to note here:

The allocation helper is taking the sting out of memory layout by doing it dynamically at run-time. This is probably fine - the cost of the pointer math is trivial compared to actually going and getting memory from the OS.
When we iterate, we are using memory to find our data members. While there exists some math to find a given member at a given index, we are storing one pointer per member in the iterator instead of one pointer total.

One of these structs could be turned into something that looks more like a value type by owning its own memory, etc. but in our applications I have found that several SoAs tend to get grouped together into a bigger 'system', and letting the system own a single block is best. Since we have already opened the Pandora's box of manually managing our memory, we might as well group things complete and cut down allocator calls while getting better locality.

Someday We'll Have This

Someday we'll have meta-programing, and when we do, it would be amazing to make a "soa_vector" that, given a POD data type, generates something like this:

struct scenery_thingie {

int count;

int stride

char * base_ptr;

float& cull_x() { return (*(float *) base_ptr); }

float& cull_y() { return *((float *) base_ptr + 4 * stride); }

float& cull_z() { return *((float *) base_ptr + 8 * stride); }

/* */

};

I haven't pursued this in our code because of the annoyance of having to write and maintain the offset-fetch macros by hand, as well as the obfuscation of what the intended data layout really is. I am sure this is possible now with TMP, but the cure would be worse than the disease. But generative meta-programming I think does promise this level of optimized implementation from relatively readable source code.

Nitty Gritty - When To Interleave

One last note - in my example, I split the X, Y and Z coordinates of my culling volume into their own arrays. Is this a good idea? If it was a vec3 struct (with x,y,z members) what should we have done?

The answer is ... it depends? In our real code, X, Y and Z are separate for SIMD friendliness - a nice side effect of separating the coordinates is that we can load four objects into four lanes of a SIMD register and then perform the math for four objects at once. This is the biggest SIMD win we'll get - it is extremely cache efficient, we waste no time massaging the data into SIMD format, and we get 100% lane utilization. If you have a chance to go SIMD, separate the fields.

But this isn't necessarily best. If we had to make a calculation based on XYZ, together, and we always use them together and we're not going to SIMD them, it might make sense to pack them together (e..g so our data went XYZXYZXYZXYZ, etc.). This would mean fetching position would require only one stride in memory and not three. It's not bad to have things together in cache if we want them together in cache.

A Tip for HiZ SSR - Parametric 't' Tracing

2020-10-21T22:01:00.001-04:00

HiZ tracing for screen space reflections is an optimization where the search is done using a hierarchial Z-Buffer (typically stashed in a mip-chain) - the search can take larger skips when the low-res Z buffer indicates that there could not possibly be an occluder. This changes each trace from O(N) to O(logN).

The technique was published in GPU Pro 5, but as best I can tell, the author found out after writing the article that he couldn't post working sample code. The result is a tough chapter to make heads or tales of, because some parts of the algorithm simply say "see the sample code". This forum thread is actually pretty useful, as is this.

The article builds a ray that is described parametrically in "Z" space, starting at the near clip plane (0.0 screen-space Z, must be a DX programmer ;-) and going out to 1.0.

If your app runs using reverse-float Z and you reverse this (starting the march at 1 and going in the -Z direction), you're going to get a ton of precision artifacts. The reason: our march has the lowest precision at the start of the march. In reverse-float Z there's a lot of hyper-Z (screen-space) depth "wasted" around the near clip plane - that's okay because it's the 'good' part of our float range, which gets better with distance. But in our case, it's going to make our ray testing a mess.

The technique is also presented tracing only near occluders and tracing only away from the camera - this is good if you view a far away mountain off a lake but not good if you view a building through a puddle from nearly straight down.

As it turns out, all of these techniques can be addressed with one restructure: parametrically tracing the ray using a parametric variable "t" instead of the actual Z buffer.

In this modification, the beginning of the ray cast is always at t = 0.0, and the end of the ray-cast (a full Z unit away) is always at t = 1.0, regardless of whether that is in the positive or negative Z direction - our unit is normalized so that its Z magnitude is either +1.0 or -1.0, which saves us a divide when intersecting with Z planes.

What does this solve? A few things:

The algorithm has high precision at the beginning of the trace because t has small magnitude - we want this for accurate tracing of close occluders and odd angles.
The algorithm is fully general whether the ray is cast toward or away from the camera - no conditional code inside the search loop. Lower 't' is always "sooner" in the ray cast.
We can now march around an occluder if we can attribute a min and max depth to it, by seeing if the range of t within a search cell overlaps the range of t in our min-max Z buffer. This is fully general for "in front" and "behind" a cast and for "toward" and "away".

I didn't see any mention of changing the parameterization in my web searching, nor did I see anyone else complaining about the precision loss from reverse-Z (it took me a good day or two to realize why my floating point was underperforming on that one) so I figured I'd put it out there.

Second. Worst. Lock. Ever.

2020-06-18T16:11:00.000-04:00

Four years ago I wrote a short post describing a dumb race condition in our reference counted art assets.

To recap the problem: X-Plane's art assets are immutable and reference counted, so they can be accessed lock-free (read-only) from rendering threads.

But X-Plane also has a lookup table from file path to art asset that is functionally a table of weak references; on creation of an art asset we use the table to find that we already loaded the art asset and bump its count. On destruction we have to grab the table lock to clear out the entry.

So version one of the code, which was really really bad, looked like this:

void object::release()
{
  if(m_count.decrement() == 0)
  {
    // Race goes here
    RAII_lock (m_table_lock());
    m_table.erase(this->name);
    delete this;
  }
}

This code is bad because after we decrement our reference count, but before we lock the table, another thread can go in, lock the table, find our art asset, increment its reference count and unlock the table - this would be caused by an async load of the same art asset (in another thread) hitting the "fast path". We then delete a non-zero-count object.

The fix at the time was this:

void object::release()
{
  RAII_lock (m_table_lock());
  if(m_count.decrement() == 0)
  {
    m_table.erase(this->name);
    delete this;
  }
}

Since the table is locked before the decrement, no one can get in and grab our object - we block out all other asset loaders before we decrement; if we hit zero reference count, we take out the sledge hammer and trash the object.

Correct But Stupid

The problem with this new design is that it holds the table lock across every release operation - even ones where there is no chance of actually releasing the object.

We hold the table lock during asset creation - the API contract for loaders is that you get back a valid* C++ object representing the art asset when the creation API returns, so this effectively means we have to hold the lock so that a second thread loading the same object can't return the partially constructed object being built by a first thread. This means the lock isn't a spin lock - it can be held across disk access for tens of milliseconds.

Well, that's not good. What happens when you put your object into a C++ smart handle that retains and releases the reference count in the constructor/destructor?

The answer is: you end up calling release all over the place and are constantly grabbing the table lock for one atomic op, and sometimes you're going to get stuck because someone else is doing real loading.

The reason this is a total fail is: client code would not expect that simply moving around ownership of the reference would be a "slow" operation the way true allocation/deallocation is. If you say "I release an art asset on the main thread and the sim glitched" I tell you you're an idiot. If you say "my vector resized and I locked the sim for 100 ms", that's not a good API.

Third Time's a Charm

The heart of the bug is that we eat the expensive table lock when we release regardless of whether we need it. So here's take three:

void object::release()
{

void object::release()
{
  if(m_count.decrement() == 0)
  {
    RAII_lock (m_table_lock());
    // If someone beat us to the table lock, check and abort.
    if(m_count.load() > 0)
      return;
    m_table.erase(this->name);
    delete this;
  }
}

This is sort of like double-checked locking: we do an early first check of the count to optimize out the table lock when it is obvious we aren't the deleter (our reference count is greater than zero after we decrement). Once we take the table lock, we then re-check that no one beat us into the table between the decrement and the lock, and if we are still okay, we delete.

The win here is that we only take the table lock in the case where we are very likely to deallocate - and client code should only be hitting that case if the client code is prepared to deallocate a resource, which is never fast. With this design, as long as resource deallocation (at the client level) is in the background with resource creation, we never hit the table lock from any critical rendering path or incidental book-keeping.

* With the Vulkan renderer we now have art assets that complete some of their loading asynchronously - this is more or less mandatory because DMA transfers to the GPU are always asynchronous. So the synchronous part of loading is establishing a C++ object functional enough to "do useful stuff with it."

We could start to erode this time by having more functionality be asynchronously available and less be always-guaranteed. But in practice it's not a real problem because entire load operations on the sim itself are already in the background, so lowering load latency doesn't get us very much real win.

Specular Hilites Have Their Own Depth

2020-04-16T22:50:00.001-04:00

I just noticed this effect while I was brushing my teeth. The faucet in the sink is:

Chrome (or stainless steel) or some other reflective "metal" and
Reasonably glossy and
Kinda dirty.

As I looked down, I could see specular reflections from the lights above the mirror and I could see calcium deposits on the surface of the faucet itself.

And...then I noticed that they didn't have the same parallax. Close one eye, then the other, and the relative placement of the specular hilites with regards to the calcium deposits changes.

In other words, the specular hilites are farther away than the surface they reflect in.

If you take a step back and think about this, it's completely logical. An image of yourself a mirror appears twice as far away as the mirror itself, and if you draw an optical diagram, this is not surprising. Twice as much eye separation parallax is 'washed out' by distance on the full optical path as to the surface of the mirror.

Interestingly, X-Plane's VR system correctly simulates this. We do per-pixel lighting calculations separately on each eye using a camera origin (and thus transform stack and light vector) specific to each eye. When I first coded this, I thought it was odd that the lighting could be "inconsistent" between eyes, but it never looked bad or wrong.

Now I realized that that inconsistency is literally a depth queue.

How to Make the OS X Terminal Change Colors for Remote Servers

2019-11-29T19:06:00.001-05:00

A thing I have done is attempt to hard shut down my Mac using the shell (e.g. sudo shutdown -r) and accidentally taken down one of our servers that I was ssh'd into.

To fix this and generally have more awareness of what the heck I am doing, I use a .profile script to change the Terminal theme when logging into and out of remote servers via SSH.

echo BashRC

function tabc {

NAME=$1; if [ -z "$NAME" ]; then NAME="Default"; fi

osascript -e "tell application \"Terminal\" to set current settings of front window to settings set \"$NAME\""

}

function ssh {

tabc "Pro"

/usr/bin/ssh "$@"

tabc "Basic"

}

I am not sure where this comes from - possibly here, but I see a few web references and I don't know which one I originally found. I'm posting it here so that I can find it the next time I get a new machine.

(Redoing this was necessary because I didn't migration-assistant my new laptop, and since it was going to Catalina, that was probably for the best.)

The script goes into ~/.profile for OS X 10.14 and older, or ~/.zprofile for OS X 10.15.

Klingons Do Not Release Software

2019-10-29T08:49:00.000-04:00

Brent Simmons on ETAs for software releases:

This is all just to say that app-making is nothing like building a house. It’s more like building the first house ever in the history of houses, with a pile of rusty nails and your bare hands, in a non-stop tornado.

He put this at the end of the article, but I think it's the most important lesson here: it's hard to estimate how long a coding task will take because every coding task that you undertake that is worth something to your business is the very first time your business has made anything like that task.

This comes from the fundamentally reusable nature of software. We see this with X-Plane. We have decent estimates for how long aircraft will take to build because each aircraft, while different from the others in terms of its aerodynamics, look, texturing, modeling, is the same in terms of the kind of tasks: 3-d modeling, UV mapping, animation, etc.

On the software side, we do everything exactly once. Once we rewrite the entire rendering engine to run under Vulkan, we will never do that again, because we will have Vulkan support, and coding it twice would be silly.

Almost every software feature ends up like this - the feature is fundamentally novel compared to the last one because you don't need a second copy of code.

Brent finishes up the article with this:

The only reason anything ever ships is because people just keep working until it’s ready.

But this is only half-true. Code also ships when the scope of the project is reduced to the things you have already finished. And I think this is important to remember because reducing scope is the one thing that actually reduces ship times.

This reminded me of the old nerd joke:

Klingons do not release software. Klingon software escapes, leaving a bloody trail of design engineers and quality assurance testers in its wake.

On a big, unwieldy, messy release with seat-of-the-pants project management, that's a pretty good description of what happens - at some point the overwhelming business force of needing to ship dominates what's left in the todo pile and the software smashes through the wall, Kool-Aid style and gives the users a big "Oh Yeah!"

I think Brent's definition is right, as long as we recognize that "ready" is a dynamic term that can change based on market conditions and the better understanding of what you really need to ship that you get from implementation. If your app figures out it's ready before you do, you might have Klingon software on your hands.

Hardware Performance: Old and New iPhones and PCs

2019-09-14T13:23:00.001-04:00

I am looking at performance on our latest revision of X-Plane Mobile, and I've noticed that the performance bottlenecks have changed from past versions.

In the past, we have been limited by some of everything - vertex count, CPU-side costs (mostly in the non-graphics code, which is ported from desktop), and fill rate/shading costs. This time around, something has changed - performance is pretty much about shading costs, and that's the whole ballgame.

This is a huge relief! Reducing shading costs is a problem for which real-time graphics has a lot of good solutions. It's normally the problem, so pretty much all rendering techniques find ways to address it.

How did this shift happen? I think part of it is that Apple's mobile CPUs have just gotten really, really good. The iPhone X has a single-core GeekBench 4 score of 4245. By comparison, the 2019 iMac with an i5-8500 at 3 ghz gets 5187. That's just not a huge gap between a full-size gaming mid-range desktop computer and...a phone.

In particular, we've noticed that bottlenecks that used to show up only on mobile have more or less gone away. My take on this is that Apple's mobile chips can now cope with not-hand-tuned code as well Intel's can, e.g. less than perfect memory access patterns, data dependencies, etc. (A few years ago, code that used to be good enough that hand tuning wouldn't be a big win would need some TLC for mobile. It's a huge dev time win to have that not be the case anymore.)

If I can summarize performance advice for today's mobile devices in a single item, it would be: "don't blend." The devices easily have enough ALU and texture bandwidth to run complex algorithms (like full PBR) at high res as long as you're not shading the full screen multiple times. Because the GPUs are tiling, the GPU will eliminate virtually any triangles that are not visible in the final product as long as blending is off.

By comparison, on desktop GPUs we find that utilization is often the biggest problem - that is, the GPU has the physical ALU and bandwidth to over-draw the scene multiple times if blending is left on, but the frame is made up of multiple smaller draw calls, and the GPU can't keep the entire card fed given the higher frequency of small jobs.

(We did try simplifying the shading environment - a smaller number of shaders doing more work by moving what used to be shader changes into conditional logic on the GPU. It was not a win! I think the problem is that if any part of a batch is changed, it's hard for the GPU to keep a lot of work in flight, even if the batch is "less different" than before optimization. Since we couldn't get down to "these batches are identical, merge them" we ended up with similarly poor utilization and higher ALU costs while shading.)

Keeping the Blue Side Up: Coordinate Conventions for OpenGL, Metal and Vulkan

2019-04-06T01:22:00.000-04:00

OpenGL, Metal and Vulkan all have different ideas about which way is up - that is, where the origin is located and which way the Y axis goes for a framebuffer. This post explains the API differences and suggests a few ways to cope with them. I'm not going to cover the Z axis or Z-buffer here - perhaps that'll be a separate post.

Things We Can All Agree On

Let's start with some stuff that's the same for all three APIs: in all three APIs the origin of a framebuffer and the origin of a texture both represent the lowest byte in memory for the data backing that image. In other words, memory addressing starts at 0,0 and then increases as we go the right, then as we go to the next line. Whether we are in texels in a texture or pixels in a framebuffer, this relationship holds up.

This means that your model's UV maps and textures will Just Work™ in all three APIs. When your artist puts 0,0 into that giant fighting robot's UV map, the intent is "the texels at the beginning of memory for that texture." You can load the image into the API the same way on all platforms and the UV map will pull out the right texels and the robot will look shiny.

All three APIs also agree on the definition of a clockwise or counterclockwise polygon - this decision is made in framebuffer coordinates as a human would see it if presented to the screen. This works out well - if your model robot is drawing the way you expect, the windings are the way your artist created them, and you can keep your front face definition consistent across APIs.

Refresher: Coordinate Systems

For the purpose of our APIs, we care about three coordinate systems:

Clip Coordinates: these are the coordinates that come out of your shader. It's often to think in terms of normalized device coordinates (NDC) - the post-clip, post-perspective divide coordinates - but you don't get to see them.
Framebuffer coordinates. These are the coordinates that are rasterized, after the NDC coordinates are transformed by the viewport transform.
Texture coordinates. These are the coordinates we feed into the samplers to read from textures. They're not that interesting because, per above, they work the same on all APIs.

OpenGL: Consistently Weird

OpenGL's conventions are different from approximately every other API ever, but at least they are self-consistent: every single origin in OpenGL is in the lower left corner of the image, so the +Y axis is always up. +Y is up in clip coordinates, NDC, and framebuffer coordinates.

What's weird about this is that every window manager ever uses +Y = down, so your OpenGL driver is prrrrrobably flipping the image for you when it sends it off to the compositor or whatever your OS has. But after 15+ years of writing OpenGL code, +Y=up now seems normal to me, and the consistency is nice. One rule works everywhere.*

In the OpenGL world, we render with the +Y axis being up, the high memory of the framebuffer is the top of the image, which is what the user sees, and if we render to texture, the higher texel coordinates are that top of the image, so everything is good. You basically can't mess this system up.

Metal: Up is Down and Down is Up

Metal's convention is to have +Y = up in clip coordinates (and NDC) but +Y = down in framebuffer coordinates, with the framebuffer origin in the upper left. While this is baffling to programmers coming from GL/GLES, it feels familiar to Direct3d programmers. In Metal, the viewport transformation has a built-in Y flip that you can't control at the API level.

The window manager presents Metal framebuffers with the lowest byte in the upper left, so if you go in with a model that transforms with +Y = up (OpenGL style), your image will come out right side up and all is good. But be warned, chaos lurks beneath the surface.

Metal's viewport and scissors are defined in framebuffer coordinates, so they now run +Y=down, and will require parameter adjustment to match OpenGL.

Note also that our screenshot code that reads back the framebuffer will have to run differently on OpenGL and Metal; one (depending on your output image file format) will require an image flip, and one will not.

Render-to-Texture: Two Wrongs Make a "Ship It"

Here's the problem with Metal: let's say we draw a nice scene with blue sky up top and green grass on the bottom. We're going to use it as an environment input and sample it. Our OpenGL code expects that low texture coordinates (near 0) get us the green grass at the bottom of memory and high texture coordinates (near 1) get us the blue sky at the top of memory.

Unfortunately in the render-to-texture case, Metal's upper-left origin has been applied - the sky is now in low memory and the grass is in high memory, and our code that samples this image will show something upside-down and probably quite silly looking.

We have two options:

Adjust the image at creation time by hacking the transform matrix or
Adjusting the code that uses the image by adjusting the sampling coordinates.

For X-Plane, we picked door number 1 - intentionally render the image upside down (by Metal standards, or "the way it was meant to be" by OpenGL standards) so that the image is oriented as the samplers expect.

Why do it this way? Well, in our case, we often have shaders that sample both from images on disk and from rendered textures; if we flip our textures on disk (to match Metal's default framebuffer orientation) then we have to adjust every UV map that references a disk image, and that's a huge amount of code, because it covers all shaders and C++ code that generate UV maps. Focusing on render-to-texture is a smaller surface area to attack.

For Metal, we need to intentionally flip the Y coordinate by applying a Y-reverse to our transform stack - in our case this also meant ensuring that every shader used the transform stack; we had a few that were skipping it and had to be set up with identity transforms so the low level code could slip in the inversion.

We also need to change our front face's winding order, because winding orders are labeled in the API based on what a human would see if the image is presented to the screen. By mirroring our image to be upside down, we've also inverted all of our models' triangle windings, so we need to change our definition of what is in front.

Sampling with the gl_FragCoord or [[position]]: Three Wrongs Make a "How Did I Get Here"?

There's one more loose end with Metal: if you wrote a shader that uses gl_FragCoord to reconstruct some kind of coordinates based on the window rasterization position, they're going to be upside-down from what your shader did in OpenGL. The upper left of your framebuffer will rasterize with 0,0 for its position, and if you pass this on to a texture sampler, you're going to pick off low memory.

Had we left well enough alone, this would have been fine, as Metal wants to put the upper left of an image in low memory when rasterizing. But since we intentionally flipped things, we're now...upside down again.

Here we have two options:

Don't actually flip the framebuffer when rendering to texture. Maybe that was a dumb idea.
Insert code to flip the window coordinates.

For X-Plane we do both: some render targets are intentionally rasterized at API orientation (and not X-Plane's canonical lower-left-origin orientation) specifically so they can be resampled using window positions. For example, we render a buffer that is sampled to get per-pixel fog, and we leave it at API orientation to get correct fogging.

Flipping the window coordinate in the sampled code makes sense when the window position is going to be used to reconstitute some kind of world-space coordinate system. Our skydome, for example, is drawn as a full screen quad that calculates the ray that would project through the point in question. It takes as inputs the four corners of the view frustum, and swapping those in C++ fixes our sampling to match our upside down^H^H^H^H^H^HOpenGL-and-perfect-just-the-way-it-is image.

What Have We Learned (About Metal)

So to summarize, with Metal:

If we're going to render to a texture for a model, we put a Y-flip into our transform stack and swap our front face winding direction.
If we're going to render to a texture for sampling via window coordinates, we don't.
If we're going to use window coordinates to reconstruct 3-d, we have to swap the reconstruction coefficients.

Vulkan: What Would Spock Do?

Apparently: headstands! Vulkan's default coordinate orientation is +Y=down, full stop. The upper left of the framebuffer is the origin, and there's no inversion of the Y axis. This is consistent, but it's also consistently different from every other 3-d API ever, in that the Y axis in clip coordinates is backward from OpenGL, Metal, and DX.

The good news is: with Vulkan 1.1 you can specify a negative viewport height, which gives you a Y axis swap. With this trick, Vulkan maches DX and Metal, and all you have to worry about it is all of the craziness listed above.

* a side effect of this is: when I built our table UI component, I defined the bottom-most row of the table as row 0. My co-workers incorrectly think this is a weird convention for UI, and one of them went as far as to write a table row flipping function called

correct_row_indexes_because_bens_table_api_was_intentionally_written_backwards_so_we_dont_ask_him_to_write_UI_controls_any_more.

The moral of the story is that +Y=up is both consistent and a great way to get out of being asked to maintain old and poorly thought out UI widgets. I trust my co-worker will come around to my way of thinking in another fifteen years.

Code like You

2018-11-21T00:09:00.000-05:00

Fantastic lightning talk from this year's cppcon:

This was what I was trying to get at in advocating for not going to the Nth degree with C++ complexity. We have finite cognitive budgets for detail, and the budget isn't super generous, so maybe don't burn it on Boost?

Solve Less General Problems

2018-08-11T17:11:00.002-04:00

Two decades ago when I first started working at Avid, one of the tasks I was assigned was porting our product (a consumer video editor - think iMovie before it was cool) from the PCI-card-based video capture we first shipped with to digital video vie 1394/Firewire.

Being the fresh-out-of-school programmer was, I looked at this and said "what we need is a hardware abstraction layer!" I dutifully designed and wrote a HAL and parameterized more or less everything so that the product could potentially use any video input source we could come up with a plugin for.

This seemed at the time like really good design, and it did get the job done - we finished DV support.

After we shipped DV support, the product was canceled, I was moved to a different group, and the HAL was never used again.

In case it is not obvious from this story:

The decision to build a HAL was a totally stupid one. There was no indication in any of the product road maps that we had the legs to do a lot of video formats.
The fully generalized HAL design had a much larger scope than parameterizing only the stuff that actually had to change for DV.
We never were able to leverage any of the theoretical upsides of generalizing the problem.
I'm pretty embarrassed by the entire thing - especially the part where I told my engineering manager about how great this was going to be.

I would add to this that had the product not been canned, I'd bet a good bottle of scotch that the next hardware option that would have come along probably would have broken the abstraction (based only on the data points of PCI and DV video) and we would have had to rewrite the HAL anyway.

There's been plenty of written by lots of software developers about not 'future-proofing' a design speculatively. The short version is that it's more valuable to have a smaller design that's easy to refactor than to have a larger design with abstractions that you don't use; the abstractions are a maintenance tax.

It's Okay To Be Less General

One way I view my growth as a programmer over the last two decades is by tracking my becoming okay with being less general. At the time I wrote the HAL, if someone more senior had told me "just go special-case DV", I almost certainly would have explained how this was terrible design, and probably have gone and pouted about it if required to do the fast thing. I certainly wouldn't have appreciated the value to the business of getting the feature done in a fraction of the time.

In my next model I started learning from the school of hard knocks. I started with a templated data model ("hey, I'm going to reuse this and it'll be glorious") and about part way through recognized that I was being killed by an abstraction tax that wasn't paying me back. (At the time templates tended to crash the compiler, so going fully templated was really expensive.) I made the right decision, after trying all of the other ones first - very American.

Being Less General Makes the Problem Solvable

I wrote about this previously, but Fedor Pikus is pretty much saying the same thing - in the very hard problem of lock-free programming, a fully general design might be impossible. Better to do something more specific to your design and have it actually work.

Here's another way to put this: every solution has strengths and weaknesses. You're better off with a solution where the weaknesses are the part of the solution you don't need.

Don't Solve Every Problem

Turns out Mike Acton is kind of saying the same thing. The mantra of the Data-Oriented-Design nerds is "know your data". The idea here is to solve the specific problem that your specific data presents. Don't come up with a general solution that works for your data and other data that your program will literally never see. General solutions are more expensive to develop and probably have down-sides you don't need to pay for.

Better to Not Leak

I haven't had a stupid pithy quote on the blog in a while, so here's some parental wisdom: it's better not to leak.

Prefer specific solutions that don't leak to general but leaky abstractions.

It can be hard to make a really general non-leaky abstraction. Better to solve a more specific problem and plug the leaks in the areas that really matter.

When Not To Serialize

2018-08-08T20:50:00.001-04:00

Over a decade ago, I wrote a blog post about WorldEditor's file format design and, in particular, why I didn't reuse the serialization code from the undo system to write objects out to disk. The TL;DR version is that the undo system is a straight serialization of the in-memory objects, and I didn't want to tie the permanent file format on disk to the in-memory data model.

That was a good design decision. I have no regrets! The only problem is: the whole premise of the post is quite misleading because:

While WorldEditor does not use its in memory format as a direct representation of objects, it absolutely does use its in-memory linkage between objects to persist higher level data structures. And this turns out to be just as bad.

What Not To Do

WorldEditor's data model works more or less like this (simplified):

A document is made up of ... um ... things.
Every thing has an optional parent and zero or more ordered children, referred to by ID.
Higher level structures are made by trees of things.

For example, a polygon is a thing that has its contours as children (the first contour is the outer contour and the subsequent ones are holes). Contours, in turn, are things that have vertices as children, defining their path.

In a WorldEditor document, a taxiway is a cluster of things with various IDs; to rebuild the full geometric information of a taxiway (which is a polygon) you need to use the parent-child relationships and look up the contours and vertices.

For WorldEditor, the in memory representation is exactly this schema, so the cost of building our polygons in our document is zero. We just build our objects and go home happy.

This seems like a win! Until...

Changing the Data Structures

As it turns out, polygon has contours has vertices is a poor choice for an in-memory model of polygons. The big bug is: where are the edges??? In this model, edges are implicit - every pair of vertices defines one.

Things get ugly when we want to select edges. WorldEditor's selection model is based on an arbitrary set of selected things. But this means that if it's not a thing, it can't be selected. Ergo: we can't select edges. This in turn makes the UI counter-intuitive. We have to go to bind-bending levels of awkwardness to pretend edges are selected when they are not.

The obvious thing to do would be to just add edges: introduce a new edge object that references its vertices, let the vertices reference adjacent edges, and go home happy.

This change would be relatively straight forward...until we go to load an old document and all of the edges are missing.

The Cost of Serializing Data Structures

The fail here is that we've serialized our data structures. And this means we have to parse legacy files in terms of those data structures to understand an old file at all. Let's look at all of the fail. To load an old file post-refactor we need to either:

Keep separate code around that can rebuild the old file structures into memory in their old form, so that we can then migrate those old in-memory structures into new ones. That's potentially a lot of old code that we probably hate - we wouldn't have rewritten it into a radically different form if we liked it.*
Alternatively, we can create a new data model that can exist with both the layout of the old and new data design. E.g. we can say that edges are optional and then "upgrade" the data model by adding them in when missing. But this sucks because it adds a lot of requirements to an in-memory data model that should probably be focused on performance and correctness.

And of coarse, the old file format you're dealing with was never designed - it's just whatever you had in memory dumped out. That's not going to be a ton of fun to parse in the future.

When Not To Serialize

The moral equivalent of this problem (using the container structures that bind together objects as a file format spec) is dumping your data structures directly into a serializer (e.g. boost::serialize or some other non-brain-damaged serialization system) and calling it a "file format".

To be clear: serializing your data structures is totally fine as long as file format stability over time is not a design goal. So for example, for undo state in WorldEditor this isn't a problem at all - undoes exist only per app run and don't have to interoperate between any instance of the app (let alone ones with code changes).

But if you need a file format that you will be able to continue to read after changing your code, serializing your containers is a poor choice, because the only way to read back the old data into the new code will be to create a shadow version of your data model (using those old containers) to get the data back, then migrate in memory.

Pay Now or Pay Later

My view is: writing code to translate your in-memory data structure from its native memory format to a persistent on disk format is a feature, not a bug: that translation code provides the flexibility to allow your in-memory and on disk data layouts change independently - when you need to change one and not the other, you can add logic to the translator. Serialization code (and the more automagic, the more so) binds these things together tightly. This is a problem when the file format and the in-memory format have different design goals, e.g. data longevity vs. performance tuning.

If you don't write that translation layer for the file format version 1, you'll have to write it for the file format version 2, and the thing you'll be translating won't be designed for longevity and sanity when parsing.

* We had to do this when migrating X-Plane 9's old binary file format (which was a memcpy of the actual airplane in memory) into X-Plane 10. X-Plane 10 kept around a straight copy of all of the C structs, fread() them into place, and then copied the data out field by field. Since we moved to a key-value pair schema in X-Plane 10, things have been much easier.

Hats Off for a Fast Turn-Around

2018-06-06T09:35:00.000-04:00

Normally I use this blog to complain about things that are broken, but I want to give credit here to Apple, for their WWDC 2018 video turn-around time. The first Metal session ended at 9 pm EDT and at 9:30 AM the next day it's already available for download. That's an incredible turn-around time for produced video, and it shows real commitment to making WWDC be for everyone and not just the attendees - last year we were at ~ 24 hour turn-around and it would have been easy to say "good enough" and take a pat on the back. My thanks to the team that had to stay up last night making this happen.

Never Map Again: Persistent Memory, System Memory, and Nothing In Between

2018-05-21T10:54:00.000-04:00

I have, over the years, written more posts on VBOs and vertex performance in OpenGL than I'd like to admit. At this point, I can't even find them all. Vertex performance is often critical in X-Plane because we draw a lot of stuff in the world; at altitude you can see a lot of little things, and it's useful to be able to just blast all of the geometry through, if we can find a high performance vertex path.

It's 2018 and we've been rebuilding our internal engine around abstractions that can work on a modern GL driver, Vulkan and Metal. When it comes to streaming geometry, here's what I have found.

Be Persistent!

First, use persistent memory if you have it. On modern GL 4.x drivers on Windows/Linux, the driver can permanently map buffers for streaming via GL_ARB_buffer_storage. This is plan A! This will be the fastest path you can find because you pay no overhead for streaming geometry - you just write the data. (It's also multi-core friendly because you can grab a region of mapped memory without having to talk to the driver at all, avoiding multi-context hell.)

That persistent memory is a win is unsurprising - you can't get any faster than not doing any work at all, and persistent memory simply removes the driver from the equation by giving you a direct memory-centric way to talk to the GPU.

Don't Be Uncool

Second, if you don't have persistent memory (e.g. you are on OS X), use system memory via client arrays, rather than trying to jam your data into a VBO with glMapBuffer or glBufferSubData.

This second result surprised me, but in every test I've run, client arrays in system memory have out-performed VBOs for small-to-medium sized batches. We were already using system memory for small-batch vertex drawing, but it's even faster for larger buffers.

Now before you go and delete all your VBO code, a few caveats:

We are mostly testing small-batch draw performance - this is UI, some effects code, but not million-VBO terrain chunks.
The largest streaming data I have tried is a 128K index buffer. That's not tiny - that's perhaps 32 VM pages, but it's not a 2 MB static mesh.
It wouldn't shock me if index buffers are more friendly to system memory streaming than vertex buffers - the 128K index buffer indexes a static VBO.

Why Would Client Arrays Be Fast?

I'd speculate, they're easier to optimize.

Unlike VBOs, in the case of client arrays, the driver knows everything about the data transfer at one time. Everything up until an actual draw call is just stashing pointers for later use - the app is required to make sure the pointers remain valid until the draw call happens.

When the draw call happens, the driver knows:

How big the data is.
What format the data is in.
Which part of the data is actually consumed by the shader.
Where the data is located (system memory, duh).
That this is a streaming case - since the API provides no mechanism for efficient reuse, the driver might as well assume no reuse.

There's not really any validation to be done - if your client pointers point to junk memory, the driver can just segfault.

Because the driver knows how big the draw call is at the time it manages the vertex data, it can select the optimal vertex transfer mode for the particular hardware and draw call size. Large draws can be scheduled via a DMA (worth it if enough data is being transferred), medium draws can be sourced right from AGP memory, and tiny draws could even be stored directly in the command buffer.

You Are Out of Order

There's one last thing we know for client arrays that we don't know for map/unmap, and I think this might be the most important one of all: in the case of client arrays, vertex transfer is strictly FIFO - within a single context (and client arrays data is not shared) submission order from the client is draw/retirement order.

That means the driver can use a simple ring buffer to allocate memory for these draw calls. That's really cheap unless the total size of the ring buffer has to grow.

By comparison, the driver can assume nothing about orphaning and renaming of VBOs. Rename/map/unmap/draw sequences show up as ad hoc calls to the driver, so the driver has to allocate new backing storage for VBOs out of a free store/heap. Even if the driver has a magazine front-end, the cost of heap allocations in the driver is going to be more expensive than bumping ring buffer pointers.

What Can We Do With This Knowledge?

Once we recognize that we're going to draw only with client arrays and persistent memory (and not with non-persistent mapped and unmapped VBOs), we can recognize a simplifying assumption: our unmap/flushing overhead is zero in every case, and we can simplify client code around this.

In a previous post, I suggested two ways around the cost of ensuring that your data is GPU-visible: persistent memory and deferring all command encoding until later.

If we're not going to have to unmap, we can just go with option 1 all of the time. If we don't have persistent coherent memory, we treat system memory as our persistent coherent memory and draw with client arrays. This means we can drop the cost of buffering up and replaying our command encoding and just render directly.

There Must Be Fifty Ways to Fail Your Stencil

2018-03-27T20:53:00.001-04:00

When we first put deferred rendering into X-Plane 10.0, complete with lots of spot lights in-scene, I coded up stencil volumes for the lights in an attempt to save some shading power. The basic algorithm is:

Do a stencil pre-pass on all light volumes where:

The back of the volume failing increments. This happens when geometry is in front of the back of the light volume - this geometry might be lit!
The front of the volume failing decrements. This happens when geometry is in front of the front light volume and thus occludes (in screen space) anything that could have been lit.

Do a second pass with stencil testing for > 0. Only pixels with a positive count had geometry that were in between the two halves of the bounding volume, and thus light candidates.

This technique eliminates both fragments of occluded lights and fragments where the light shines through the air and hits nothing.

Typically the stencil modes are inc/dec with wrapping so that we aren't dependent on our volume fragments going out in any particular order - it all nets out.

We ended up not shipping this for 10.0 because it turned out the cure was worse than the disease - hitting the light geometry a second time hurt fps more than the fill savings for a product that was already outputting just silly amounts of geometry.

I made a note at the time that we could partition our lights and only stencil the ones in the first 200m from the camera - this would get all the fill heavy lights without drawing a ton of geometry.

I came back to this technique the other day, but something had changed: we'd ended up using a pile of our stencil bits for various random purposes, leaving very little behind for stencil volumes. We were down to 3 bits for our counter, and this was the result.

That big black void in between the lights in the center of the screen is where the number of overlapping non-occluded lights hitting a light-able surface hit exactly the wrap-around point in our stencil buffer - we got eight increments, wrapped to zero and the lights were stencil-tested out. The obvious way to cope with this is to use more than 3 stencil bits. :-)

I looked at whether there was something we could do in a single pass. Our default mode is to light with the back of our light volume, untested; the far clip plane is, well, far away, so we get good screen coverage.

I tried lighting with the front of the light volume, depth tested, so that cases where the light was occluded by intervening geometry would optimize out. I used GL_ARB_depth_clamp to ensure that the front of my light volume would be drawn even if it hit the front clip plane.

It did not work! The problem is: since our view is a frustum, the side planes of the view volume cross at the camera location; thus if we are inside our light volume, the part behind us will be culled out despite depth clamp. This wasn't a problem for stencil volumes because they do the actual drawing off the back of the volume, and the front is just for optimization.

Flush Less Often

2018-01-29T21:38:00.000-05:00

Here's a riddle:

Q: What does a two year old and OpenGL have in common?

A: You never know when either of them is going to flush.*

In the case of a two-year old, you can wait a few years and he'll find different ways to not listen to you; unfortunately in the case of OpenGL, this is a problem of API design; therefore we have to use a fairly big hammer to fix it.

What IS the Flushing Problem?

Modern GPUs work by writing a command buffer (a list of binary encoded drawing instructions) to memory (using the CPU) and then "sending" that buffer to the GPU for execution, either by changing ownership of the shared memory or by a DMA copy.

Until that buffer of commands goes to the GPU, from the GPU's perspective, you haven't actually asked it to do anything - your command buffer is just sitting there, collecting dust, while the GPU is idle.

In modern APIs like Vulkan, Metal, and DX12, the command buffer is an object you build, and then you explicitly send it to the GPU with an API call.

With OpenGL, the command buffer is implicit - you never see it, it just gets generated as you make API calls. The command buffer is sent to the GPU ("flushed") under a few circumstances:

If you ask GL to do so via glFlush.
If you make a call that does an automatic flush (glFinish, glSwapBuffer, waiting on a sync with the flush bit).
If the command buffer fills up due to you doing a lot of stuff.

This last case is the problematic one because it's completely unpredictable.

Why Do We Care?

Back in the day, we didn't care - you'd write commands and buffers would go out when they were full (ensuring a "goodly amount of work" gets sent to the GPU) and the last command buffer was sent when you swapped your back buffer.

But with modern OpenGL, calling the API is only a small fraction of the work we do; most of the work of drawing involves filling buffers with numbers. This is where your meshes and hopefully constant state are all coming from.

The flushing problem comes to haunt us when we want to draw a large number of small drawing batches. It's easy to end up with code like this:

// write some data to memory
glDrawElements.
// write some data to memory

glDrawElements.

Expanding this out, the code actually looks more like:

// map a buffer

// write to the buffer

// flush and unmap the buffer

glDrawElements.

// map a buffer

// write to the buffer

// flush and unmap the buffer

glDrawElements.

The problem is: even with glMapBufferRange and "unsynchronized" buffers, you still have to issue some kind of flush to your data before each drawing call.

The reason this is necessary is: glDrawElements might cause your command buffer to be sent to the GPU at any time! Therefore you have to have your data buffer completely flushed and ready to go after every drawing call.

How Do We Fix It?

You basically have two choices to make code like the above fast:

If your are on a modern GL, use persistent coherent buffers. They don't need to be flushed - you can write data, call draw, and if the GL happens to send the command buffer down, your data is already visible. This is a great solution for UBOs on Windows.
If you can't get persistent coherent buffers, defer all of your actual state and draw calls until every buffer has been built.

This second technique is a double-edged sword.

Win: it works every-where, even on the oldest OpenGL.
Win: as long as you're accumulating your state change, you can optimize out stupid stuff - handy when client code tends to produce crap OpenGL call-streams.
Lose: it does require you to marshal the entire API, so it's only good for code that sits on a fairly narrow foot-print.

For X-Plane, we actually intentionally choose not to use UBOs when persistent-coherent buffers are not also available. It turns out the cost of flushing per draw call is really bad, and our fallback path (loose uniforms) is actually surprisingly fast, because the driver guys have tuned the bejeezus out of that code path.

* My two-year old has figured out how to flush the toilet and thinks it's fascinating. What he hasn't figured out how to do is listen^H^H^H^H^Hwait until I'm done peeing. (And yes, non-parents, of coarse peeing is a group activity. Duh.) The monologue went something like:

"Okay Ezra, wait until Daddy's done. No, not yet. It's too soon. Don't flush. Ezra?! Sigh. Wait, this is exactly like @#$@#$ glDrawElements!"

Fixing Camera Shake on Single Precision GPUs

2018-01-13T11:18:00.002-05:00

I've tried to write this post twice now, and I keep getting bogged down in the background information. In X-Plane 11.10 we fixed our long-standing problem of camera shake, caused by 32-bit floating point transforms in a very large world.

I did a literature search on this a few months ago and didn't find anything that met our requirements, namely:

Support GPUs without 64-bit floating point (e.g. mobile GPUs).
Keep our large (100 km x 100 km) mesh chunks.

I didn't find anything that met both of those requirements (the 32-bit-friendly solutions I found required major changes to how the engine deals with mesh chunks), so I want to write up what we did.

Background: Why We Jitter

X-Plane's world is large - scenery tiles are about 100 km x 100 km, so you can be up to 50 km from the origin before we "scroll" (e.g. change the relationship between the Earth and the primary rendering coordinate system so the user's aircraft is closer to the origin). At these distances, we have about 1 cm of precision in our 32-bit coordinates, so any time we are close enough to the ground that 1 cm is larger than 1 pixel, meshes will "jump" by more than 1 pixel during camera movement due to rounding in the floating point transform stack.

It's not hard to have 1 pixel be larger than 1 cm. If you are looking at the ground on a 1920p monitor, you might have 1920 pixels covering 2 meters, for about 1 mm per pixel. The ground is going to jitter like hell.

Engines that don't have huge offsets don't have these problems - if we were within 1 km of the origin, we'd have almost 100x more precision and the jitter might not be noticeable. Engines can solve this by having small worlds, or by scrolling the origin a lot more often.

Note that it's not good enough to just keep the main OpenGL origin near the user. If we have a large mesh (e.g. a mesh whose vertices get up into the 50 km magnitude) we're going to jitter, because at the time that we draw them our effective transform matrix is going to need an offset to bring the 50 km offset back to the camera. (In other words, even if our main transform matrix doesn't have huge offsets that cause us to lose precision, we'll have to do a big translation to draw our big object.)

Fixing Instances With Large Offsets

The first thing we do is make our transform stack double precision on the CPU (but not the GPU). To be clear, we need double precision:

In the internal transform matrices we keep on the CPU as we "accumulate" rotates, translates, etc.
In the calculations where we modify this matrix (e.g. if we are going to transform, we have to up-res the incoming matrix, do the calculation in double, and save the results in double).
We do not have to send the final transforms to the GPU in double - we can truncate the final model-view, etc.
We can accept input transforms from client code in single or double precision.

This will fix all jitter caused by objects with small offset meshes that are positioned far from the origin. Eg. if our code goes: push, translate (large offset), rotate (pose), draw, pop, then this fix alone gets rid of jitter on that model, and it doesn't require any changes to the engine or shader.

We do eat the cost of double precision in our CPU-side transforms - I don't have numbers yet for how much of a penalty on old mobile phones this is, but on desktop this is not a problem. (If you are beating the transform stack so badly that this matters, it's time to use hardware instancing.)

This hasn't fixed most of our jitter - large meshes and hardware instances are still jittering like crazy, but this is a necessary pre-requisite.

Fixing Large Meshes

The trick to fixing jitter on meshes with large vertex coordinates is understanding why we have precision problems. The fundamental problem is this: transform matrices apply rotations first and translations second. Therefore in any model-view matrix that positions the world, the translations in the matrix have been mutated by the rotation basis vectors. (That's why your camera location is not just items 12,13, and 14 of your MV matrix.)

If the camera's location in the world is a very big number (necessary to get you "near" those huge-coordinate vertices so you can see them) then the precision at which they are transformed by the basis vectors is...not very good.

That's not actually the total problem. (If it was, preparing the camera transform matrix in double on the CPU would have gotten us out of jail.)

The problem is that we are counting on these calculations to cancel each other out:

vertex location * camera rotation + (camera rotation * camera location) = eye space vertex

The camera rotated location was calculated on the CPU ahead of time and baked into the translation component of your MV matrix ,but the vertex location is huge and is rotated by the camera rotation on the GPU in 32-bits. So we have two huge offsets multiplied by very finicky rotations - we add them together and we are hoping that the result is pixel accurate, so that tiny movements of the camera are smooth.

They are not - it's the rounding error of the cancelation of these calculations that is our jitter.

The solution is to change the order of operations of our transform stack. We need to introduce a second translation step that (unlike a normal 4x4 matrix operation), happens before rotation, in world coordinates and not camera coordinates. In other words, we want to do this:

(vertex location - offset) * camera rotation + (camera rotation * (camera location - offset)) = ...

Heres' why this can actually work: "offset" is going to be a number that brings our mesh roughly near the camera. Since it doesn't have to bring us to the camera, it can change infrequently and have very few low-place bits to get lost by rounding. Since our vertex location and offset are not changing, this number is going to be stable across frames.

Our camera location minus this offset can be done on the CPU side in double precision, so the results of this will be both small (in magnitude) and high precision.

So now we have two small locations multiplied by the camera rotation that have to cancel out - this is what we would have had if our engine used only small meshes.

In other words, by applying a rounded, infrequently changing static offset first, we can reduce the problem to what we would have had in a small-world engine, "just in time".

You might wonder what happens if the mesh vertex is no-where near our offset - my claim that the result will be really small is wrong. But that's okay - since the offset is near the camera, mesh vertices that don't cancel well are far from the camera and too small/far away to jitter. Jitter is a problem for close stuff.

The CPU-side math goes like this: given an affine model-view matrix in the form of R, T (where R is the 3x3 rotation and T is the translation vector), we do this:

// Calculate C, the camera's position, by reverse-

// rotating the translation

C = transpose(R) * T

// Grid-snap the camera position in world coordinates - I used

// a 4 km grid. smaller grids mean more frequent jumps but

// better precision.

C_snap = grid_round(C)

// Offset the matrix's translation by this snap (moved back

// to post-rotation coordinates), to compensate for the pre-offset.

T -= R * C_snap

// Pre-offset is the opposite of the snap.

O = -C_snap

In our shader code, we transform like this:

v_eye = (v_world - O) * modelview_matrix

There's no reason why the sign has to be this way - O could have been C_snap and we could have added in the shader; I found it was easier to debug having the offset be actual locations in the world.

Fixing Hardware Instancing

There's one more case to fix. If your engine has hardware instancing, you may have code that takes the (small) model mesh vertices and applies an instancing transform first, then the main transform. In this case, the large vertex is the result of the instancing matrix, not the mesh itself.

This case is easily solved - we simply subtract our camera offset from the translation part of the hardware instance. This ensures that the instance, when transformed into our world, will be near the camera - no jitter.

One last note: on some drivers I found the driver was very finicky about order of operations - if the calculation is not done by applying the offset before the transform, the de-jitter totally fails. The precise and invariant qualifiers didn't seem to help, only getting the code "just right" did.

How to Reset Steam VR When It Can't Talk to the Rift

2017-06-07T23:51:00.001-04:00

Periodically in the coarse of writing an OpenVR app, I find that SteamVR can't talk to my HMD. One of the 500 processes that collaborate to make VR work has kicked the bucket. Here's the formula to fix it.

First, kill the process tree based on OVRServer_x64. All the Oculus stuff should die and then immediately respawn. Minimize their portal thingie.

Kill every vrXXX process (vrserver, vrmonitor,vrcompositor, vrdashboard). SteamVR should not look like it's running and will not auto-relaunch.

Now you're good - relaunch your game and SteamVR should restart and be able to communicate with the headset.