Saturday, June 11, 2022

sRGB, Pre-Multiplied Alpha, and Compression

 How do sRGB color-space textures (especially compressed ones) interact with blending? First, let's define some terms:

An image is "sRGB encoded" if the linear colors have been remapped using the sRGB encoding. This encoding is meant for 8-bit storage of the textures and normalized colors from 0.0 to 1.0. The "win" of sRGB encoding is that it uses more of our precious 8 bits on the dark side of the 0..1 distribution, and humans have weird non-linear vision where we see more detail in that range.

Thus sRGB "spends our bits where they are needed" and limits banding artifacts.  If we just viewed an 8-bit color ramp from 0 to 1 linearly, we'd see stripes in the darks and perceive a smooth color wash in the hilights.

Alpha values are always considered linear, e.g. 50% is 50% and is stored as 127 (or maybe 128, consult your IHV?) in an 8-bit texture. But how that alpha is applied depends on the color space where blending is performed.

  • If we decode all of our sRGB textures to linear values (stored in more than 8 bits) and then blend in this linear space, we have "linear blending" (sometimes called "gamma-correct" blending - the terminology is confusing no matter how you slice it).  With linear blending, translucent surfaces blend the way light blends - a mix of red and green will make a nice bright yellow. Most game engines now work this way, it's pretty much mandatory for HDR (because we're not in a 0..1 range, we can't be in sRGB) and it makes lighting effects look good.
  • If we stay in the 8-bit sRGB color space and just blend there, we get "sRGB blending". Between red and green we'll see a sort of dark mustard color and perceive a loss of light energy.  Lighting effects blended in sRGB look terrible, but sometimes artists want sRGB blending. The two reasons I've heard from my art team are (1) "that's what Photoshop does" and (2) they are trying to simulate partial coverage (e.g. this surface has rusted paint and so half of the visible pixel area is not showing the paint) and blending in a perceptual space makes the coverage feel right.
A texture has pre-multiplied alpha if the colors in the texture have already been multiplied (in fractional space) by the alpha channel. In a texture like this, a 50% translucent area has RGB colors that are 50% darker. There are two wins to this:
  • You can save computing power - not exciting for today's GPUs, but back when NeXT first demonstrated workstations with real-time alpha compositing on the CPU (in integer of coarse), premultiplication was critical for cutting ALU by removing half the multiplies from the blending equation.
  • Premultiplied alpha can be filtered (e.g. two samples can be blended together) without any artifacts.  The black under clear pixels (because they are multiplied with 0) is the correct thing to blend into a neighboring opaque pixel to make it "more transparent" - the math Just Works™. With non-premultiplied textures, the color behind clear pixels appears in intermediate samples - tool chains must make sure to "stuff" those clear texels with nearby colors.

Pre-Multiplication and sRGB

Can we pre-multiply sRGB encoded textures that are intended to be decoded and used in a linear renderer?Yes! But, here's the catch: we can only do this right if we know the type of blending we will do.

Pre-multiplying is basically doing a little bit of the blending work ahead of time. Therefore it's actually not surprising that we need to know the exact math of the blend to do this work in advance. There are two cases:

If we are going to blend in sRGB (e.g. nothing is ever decoded) we can just pre-multiply our 8-bit colors with 8-bit alpha and go home happy.  This is what we did in 2005 and we liked it, because it's all there was.

If we are going to decode our sRGB to linear space, then blend, we have to do something more complex: when pre-multiplying our texture we need to:
  1. Decode our colors to linear (and use more than eight bits to do the following intermediate calculations).
  2. Multiply the alpha value by the linear color.
  3. Re-encode the resulting darkened color back to sRGB.
We have now baked the multiply that would have happened in normal linear blending. This should be okay in terms of precision - while we are likely to have darker results post-blending, sRGB is already using bits for those darker areas, and they're going to have less weight on screen.

The fine print here is that the "bake" we did to convert from non-premultiplied to pre-multiplied has different math based on the blending color space and that decision is permanent. While a non-premultiplied texture can be used for either blending, once you pre-multiply, you're committed.

Does Compression Change Anything?

What if our sRGB textures are going to be compressed? In theory, no - S3TC compression states that the block is decompressed into sRGB color space first, then "other stuff happens". And this is useful - with only 16 bits for block color end-points, S3TC blocks need to spread their bits as evenly as possible from a perceptual perspective.

In practice, I'd be very wary. DXT3/5 provide two separate compression blocks, one for color, and one for alpha. But with premultiplied alpha, the alpha mask has been "baked" into the color channel and are going to put pressure on end-point selection.

Consider the case of a 4x4 tile that goes from clear to opaque as we go left to right, and green to blue as we go from bottom to top.

Without pre-multiplication, we pick green and blue as color end points, clear and opaque as alpha end points, and we can get very accurate compression.

With pre-multiplication we're screwed. We either pick black as an end point (so we can have black in the color under the clear pixels) and our color end point has to be a single green/blue mash-up, destroying any color change, or we pick green and blue as our colors and we have "light leak" when our colors aren't dark under the alpha.

For this reason, I think pre-multiplying isn't appropriate for block compressors even if we get linear blending correct.

Friday, October 29, 2021

This One Weird Trick Let's You Beat the Experts

As I've mentioned in the past, one of my goals with this blog is to launch my post-development career in tech punditry and eventually author an O'Reilly book with an animal on the cover.* So today I'm going to add click-baity titles to my previous pithy quotes and tell you about that one weird trick that will let you beat the experts and write faster code than the very best libraries and system implementations.

But first, a Vox-style explanatory sub-heading.

StackOverflow is a Website That Destroys New Programmers' Hopes and Dreams

Here is a typical question you might see on Stack Overflow**:

I'm new to programming, but I have written a ten line program. How can I wrote my own custom allocator for my new game engine? I need it to be really fast and also thread safe.

StackOverflow is a community full of helpful veteran programmers who are there for the newbies among us, so someone writes an answer like this.***
Stop. Just stop. Unplug your keyboard and throw it into the nearest dumpster - it will help your game engine to do so. The very fact that you asked this question shows you should not be trying to do what you do.

You cannot beat the performance of the system allocator. It was written by a Linux programmer with an extraordinarily long beard. That programmer spent over 130 years studying allocation patterns. To write the allocator, he took a ten year vow of silence and wrote the allocator in a Tibetan monastery using the sand on the floor as his IDE. The resulting allocator is hand-optimized for x86, ARM, PDP-11 macro-assembler, and the Zilog Z-80. It uses non-deterministic finite state automata, phase-induced reality distortion fields, atomic linked lists made from enriched Thorium, and Heisenberg's uncertainty principle. It is capable of performing almost 35 allocations per second. You are not going to do better.
Discouraged, but significantly wiser, our young programmer realizes that his true calling in life is to simply glue together code other people have written, and vows to only develop electron apps from this point forward.

But what if I told you there was this one weird trick that would allow our newbie programmer to beat the system allocator's performance?

The Secret of My Success

Come closer and I will whisper to you the one word that has changed my life. This one thing will change how you code forever.  Here's what our young programmer needs to do to beat the system allocator:

Cheat.

Here, I'll show you how it's done.
static char s_buffer[1024];
void * cheat_malloc(size_t bytes) { return s_buffer; }
void cheat_free(void * block) { }
(Listing 1 - the world's worst allocator.)

One thing you cannot deny: this code is a lot faster than malloc.

Now you might be saying: "Ben, that is literally the worst allocator I have ever seen" (to which I say "Hold my beer") but you have to admit: it's not that bad if you don't need all of the other stuff malloc does (like getting blocks larger than 1K or being able to call it more than once or actually freeing memory). And really fast. Did I mention it was fast?

And here we get to the pithy quotes in this otherwise very, very silly post:

You might not need the full generality and feature set of a standard solution.
and
You can write a faster implementation than the experts if you don't solve the fully general problem.
In other words, the experts are playing a hard game - you win by cheating and playing tic tac toe while they're playing chess.

All That Stuff You Don't Need

A standard heap allocator like malloc does so much stuff. Just look at this requirements list.
  • You can allocate any size block you want. Big blocks! Small blocks! Larger than a VM page. Smaller than a cache line. Odd sized blocks, why not.
  • You can allocate and deallocate your blocks in any order you want. Tie your allocation pattern to a random number generator, it's fine, malloc's got your back.
  • You can free and allocate blocks from multiple threads concurrently, and you can release blocks on different threads than you allocated them from. Totally general concurrency, there are no bad uses of this API.
The reason I'm picking on these requirements is because they make the implementation of malloc complicated and slow.****

One of the most common ways to beat malloc that every game programmer knows is a bump allocator. (I've heard this called a Swedish allocator, and it probably has other names too.) The gag is pretty simple:
  1. You get a big block of memory and keep it around forever. You start with pointer to the beginning.
  2. When someone wants memory, you move the pointer forward by the amount of memory they want and return the old pointer.
  3. There is no free operation.
  4. At the end of a frame, you reset the pointer to the beginning and recycle the entire block.
  5. If you want this on multiple threads, you make one of these per thread.
This is fast! Allocation is one add function, freeing is zero operations, and there are no locks. In fact, cache coherency isn't that bad either - successive allocations will be extremely close together in memory.

Our bump allocator is only slightly more complex than the world's worst allocator, but it shares a lot in common: it's faster because it doesn't provide all of those general purpose features that the heap allocator implements. If the bump allocator is too special purpose, you're out of luck, but if it fits your design needs, it's a win.

And it's simple enough you can write it yourself.

General Implementations are General

You see the same thing with networking. The conventional wisdom is: "use TCP, don't reinvent TCP", but when you look into specific domains where performance matters (e.g. games, media streaming) you find protocols that do roll their own, specifically because TCP comes with a bunch of overhead to provide its abstraction ("the wire never drops data") that are expensive and not needed for the problem space.

So let me close with when to cheat and roll your own. It makes sense to write your own implementation of something when:
  • You need better performance than a standard implementation will get you and
  • Your abstraction requirements are simpler or more peculiar than the fully general case and
  • You can use those different/limited requirements to write a faster implementation.
You might need a faster implementation of the fully general case - that's a different blog post, one for which you'll need a very long beard.



* The animal will be some kind of monkey flinging poop, obviously.

** Not an actual StackOverflow question.

*** Not an actual StackOverflow answer. You can tell it's not realistic because the first answer to any SO question is always a suggestion to use a different language.

**** I mean, not that slow - today's allocators are pretty good - and certainly better than I can write given those requirements. But compared to the kinds of allocators that don't solve those problems, malloc is slower.

Saturday, October 16, 2021

Is It Ever Okay to Future Proof?

A while ago I tried to make a case for solving domain-specific problems, rather than general ones. My basic view is that engineering is a limited resource, so providing an overly general solution takes away from making the useful specific one more awesome (or quicker/cheaper to develop).

Some people have brought up YAGNI, with opinions varying from "yeah, YAGNI is spot on, stop building things that will never get used" to "YAGNI is being abused as an excuse to just duct tape old code without ever paying off technical debt."

YAGNI is sometimes orthogonal to generalization. "Solving specific problems" can mean solving two smaller problems separately rather than trying to come up with one fused super-solution. In this case, there's no speculative engineering or future proofing, it's just a question of whether the sum can be less than the parts. (I think the answer is: Sometimes! It's worth examining as part of the design process.)

But sometimes YAGNI is a factor, e.g. the impulse to solve a general problem comes from an assumption that this general design will cover future needs too. Is that ever a good thing to assume?

Nostradamus the Project Manager

So: is it ever okay to future proof? Is it ever okay to have design requirements for a current engineering problem to help out with the next engineering problem?  Here are three tests to apply - if any of them fail, don't future proof.

  • Do you for a fact that the future feature is actually needed by the business? If not, put your head down and code what needs to be done today.
  • Do you know exactly how you would solve the future feature efficiently with today's code?  If not, back away slowly.
  • Would the total engineering cost of both features be, in sum, smaller if you do some future proofing now?  If not, there's no win here.
Bear in mind, even if you pass all of these three tests, future proofing might still be a dumb idea. If you are going to grow the scope of the feature at all by future proofing, you had best check with your team about whether this makes sense. Maybe time is critical or there's a hard ship date. Keep it tight and don't slip schedule.

I find the three tests useful to cut off branches of exploration that aren't going to be useful. If my coworker is worrying that the design is too specific and can't stop playing "what if", these tests are good, because they help clarify how much we know about the future. If the answer is "not a ton", then waiting is the right choice.

And I think it's worth emphasizing why this works. When solving new coding problems, one of the ways we learn about the problem space and its requirements is by writing code. (We can do other things like create prototypes and simulations, test data, etc. but writing code absolutely is a valid way to understand a problem.)

The act of writing feature A doesn't just ship feature A - it's the R&D process by which you learn enough to write feature B. So fusing A & B's design may not be possible because you have to code A to learn about B. These questions hilight when features have this kind of "learning" data dependency.

Sometimes Ya Are Gonna Need It

With that in mind, we do sometimes future proof code in X-Plane. This almost always happens when we are designing an entire subsystem, we know its complete feature set for the market, but there is an advantage to shipping a subset of the features first. When I wrote X-Plane's particle system, we predicted (correctly) that people want to use the effects on scenery and aircraft, but we shipped aircraft first with a design that wouldn't need rewrites to work with scenery.

Given how strong the business case is for scenery effects, and given that we basically already know how this code would be implemented on its own (almost exactly the same as the aircraft code, but over here), it's very cheap to craft the aircraft code with this in mind.

But it requires a lot of really solid information about the future to make this win.

Monday, June 21, 2021

A Coroutine Can Be An Awaitable

This is part five of a five part series on coroutines in C++.

  1. Getting Past the Names
  2. Coroutines Look Like Factories
  3. co_await is the .then of Coroutines
  4. We Never Needed Stackful Coroutines
  5. A Coroutine Can Be An Awaitable

The relationship between awaitables and coroutines is a little bit complicated.

First, an awaitable is anything that a coroutine can "run after". So an awaitable could be an object like a mutex or an IO subsystem or an object representing an operation (I "awaited" my download object and ran later once the download was done).

I suppose you could say that an awaitable is an "event source" where the event is the wait being over. Or you could say events can be awaitables and it's a good fit.

A coroutine is something that at least partly "happens later" - it's the thing that does the awaiting. (You need a coroutine to do the awaiting because you need something that is code to execute. The awaitable might be an object, e.g. a noun.)

Where things get interesting is that coroutines can be awaitables, because coroutines (at least ones that don't infinite loop) have endings, and when they end, that's something that you can "run after". The event is "the coroutine is over."

To make your coroutine awaitable you need to do two things:

  1. Give it a co_await operator so the compiler understands how to build an awaitable object that talks to it and
  2. Come up with a reason to wake the caller back up again later.
Lewis Baker's cppcoro task is a near-canonical version of this.
  • The tasks start suspended, so they run when you co_await them.
  • They use their final_suspend object to resume the coroutine that awaited them to start them off.
Thus tasks are awaitables, and they are able to await (because they are croutines) and they can be composed indefinitely.

While any coroutine can be an awaitable, they might no be. I built a "fire and forget" async coroutine that cleans itself up when it runs off the end - it's meant to be a top level coroutine that can be run from synchronous code, thus it doesn't try to use coroutine tech to signal back to its caller. The actual C++ in the coroutine need to decide how to register their final results with the host app, maybe by calling some other non-coroutine execution mechanism.

Sunday, June 20, 2021

We Never Needed Stackfull Coroutines

This is part for of a five part series on coroutines in C++.

  1. Getting Past the Names
  2. Coroutines Look Like Factories
  3. co_await is the .then of Coroutines
  4. We Never Needed Stackful Coroutines
  5. A Coroutine Can Be An Awaitable

You've probably seen libraries that provide a "cooperative threading" API, e.g. threads of execution that yield to each other at specific times, rather than when the Kernel decides to. Windows Fibers, Boost Fibers, ucontext on Linux. I've been programming for a very long time, so I have written code for MacOS's cooperative "Thread Manager", back before cooperative multi-tasking was cool - at the time we were really annoyed that there was no pre-emption and that one bad worker thread would hang your whole machine. We were truly savages back then.

The gag for all of these "fiber" systems is the same - you have these 'threads' that only yield the CPU to another one via a yield API call - the yield call may name its successor or just pick any runnable non-running thread. Each thread has its own stack, thus our function (and its callstack and all state) is suspended mid-execution at the yield call.

Sounds a bit like coroutines, right?

There are two flavors of coroutines - stackful, and stackless. Fibers are stackful coroutines - the whole stack is saved to suspend the routine, allowing us to suspend at any point in execution anywhere.

C++20 coroutines are stackless, which means you can only suspend inside a coroutine itself - suspension depends on the compiler transforms of the coroutine (into a state machine object) to achieve the suspend.

When I first heard this, I thought: "well, performance is good but I wonder if I'll miss having full stack saves."

As it turns out, we don't need stack saves - here's why.

First, a coroutine is a function that has at least one potential suspend point. If there are zero potential suspend points, it's just a function, like from the 70s with the weed an the K&R syntax, and to state the obvious, we don't have to worry about a plain old function trying to suspend on us.

So if a coroutine calls a plain old function, the function can't suspend, so not being able to save the whole stack isn't important. We only need ask: "what happens if a coroutine calls a coroutine, and the child coroutine suspends?"

Well, we don't really "call" a child coroutine - we co_await it! And co_awaiting Just Works™.

  • When our parent coroutine co_awaits the child routine, the child coroutine (via the awaitable mediating this transfer of execution) gets a handle to the parent coroutine to stash for later. We can think of this as the "return address" from a stack function call, and the child coroutine can stash it somewhere in its promise storage.
  • When the child coroutine co-awaits (e.g. on I/O) this works as expected - we don't return to the parent coroutine because it's already suspended.
  • When the child coroutine finishes for real, its final suspend awaitable can return the parent coroutine handle (that we stashed) to transfer control back to the parent - this is the equivalent of using a RET opcode to pop the stack and jump to the return address.
In other words, our chain of co-routines forms a synthetic stack-like saving structure - the coroutine frames form a linked list (from children to parents) via the stashed handle to the coroutine) and each frame stores state for that function.



Saturday, June 19, 2021

co_await is the .then of Coroutines

This is part three of a five part series on coroutines in C++.

  1. Getting Past the Names
  2. Coroutines Look Like Factories
  3. co_await is the .then of Coroutines
  4. We Never Needed Stackful Coroutines
  5. A Coroutine Can Be An Awaitable

One more intuition about coroutines: the co_wait operator is to coroutines as .then (or some other continuation queueing routine) is to a callback/continuation/closure system.

In a continuation-based system, the fundamental thing we want to express is "happens after". We don't care when our code runs as long as it is after the stuff that must run before completes, and we'd rather not block or spin a thread to ensure that. Hence:

void dump_file(string path)

{

    future<string> my_data = file.async_load("~/stuff.txt");

    my_data.then([my_data](){

        printf("File contained: %s\n", my_data.get().c_str());

    });

}

The idea is that our file dumper will begin the async loading process and immediately return to the caller once we're IO bound. At some time later on some mystery thread (that's a separate rant) the IO will be done and our future will contain real data. At that point, our lambda runs and can use the data.

The ".then" API ensures that the file printing (which requires the data) happens after the file I/O.

With coroutines, co_await provides the same service.

async dump_file(string path)

{

    string my_data = co_await file.async_load("~/stuff.txt");

    printf("File contained: %s\n", my_data.c_str());

}

Just like before, the beginning of dump_file runs on the caller's thread and runs the code to begin the file load. Once we are I/O bound, we exit all the way back to the caller; some time later the rest of our coroutine will run (having access to the data) on a thread provided by the IO system.

Once we realize that co_await is the .then of coroutines, we can see that anything in our system with an async continuation callback could be an awaitable. A few possibilities:

  • APIs that move execution to different thread pools ("run on X thread")
  • Non-blocking I/O APIs that call a continuation once I/O is complete.
  • Objects that load their internals asynchronously and run a continuation once they are fully loaded.
  • Serial queues that guarantee only one continuation runs at a time to provide non-blocking async "mutex"-like behavior.
Given an API that takes a continuation, we can transform it into an awaitable by wrapping the API in an awaitable with a continuation that "kicks" the awaitable to resume the suspended coroutine.

Awaitables are also one of two places where you get to find out who any given coroutine is - await suspend gives you the coroutine handle of the client coroutine awaiting on you, which means you can save it and put it in a FIFO or anywhere else for resumption later.  You also get to return a coroutine handle to switch to any other code execution path you want.

For example, a common pattern for a coroutine that computes something is:
  1. Require the client to co_await the coroutine to begin execution - the awaitable that does this saves the client's coroutine handle.
  2. Use a final_suspend awaitable to return to the client whose coroutine we stashed.
(This is what cppcoro's task does.)

The other place we can get a coroutine handle is in the constructor of our promise, which can use from_promise to find the underlying coroutine it is attached to, and pass it to the return type, allowing handles to coroutines to connect to their coroutines.

Friday, June 18, 2021

Coroutines Look Like Factories

This is part two of a five part series on coroutines in C++.

  1. Getting Past the Names
  2. Coroutines Look Like Factories
  3. co_await is the .then of Coroutines
  4. We Never Needed Stackful Coroutines
  5. A Coroutine Can Be An Awaitable

In my previous post I complained about the naming of the various parts of coroutines - the language design is great, but I find myself having to squint at the parts sometimes.

Before proceeding, a basic insight about how coroutines work in C++.

You write a coroutine like a function (or method or lambda), but the world uses it like a factory that returns a handle to your newly created coroutine.

One of the simplest coroutine types I was able to write was an immediately-running, immediately-returning coroutine that returns nothing to the caller - something like this:

class async {

public:

    struct control_block {

        auto initial_suspend() { return suspend_never(); }

        auto final_suspend() { return suspend_never(); }

        void return_void() {}

        void unhandled_exception() { }

    };

    using promise_type = control_block;

};

The return type "async" returns a really useless handle to the client. It's useless because the coroutine starts on its own and ends on its own - it's "fire and forget".  The idea is to let you do stuff like this:

async fetch_file(string url, string path)

{

    string raw_data = co_await http::download_url(url);

    co_await disk::write_file_to_path(path, raw_data);

}

In this example, our coroutine suspends on IO twice, first to get data from the internet, then to write it to a disk, and then it's done.  Client code can do this:

void get_files(vector<pair<string,string>> stuff)

{

    for(auto spec : stuff)

    {

        fetch_file(spec.first,spec.second);
    }
}

To the client code, fetch_file is a "factory" that will create one coroutine for each file we want to get; that coroutine will start executing using the caller for get_files, do enough work to start downloading, and then return.  We'll queue a bunch of network ops in a row.

How does the coroutine finish? The IO systems will resume our coroutine once data is provided. What thread is executing this code? I have no idea - that's up to the IO system's design. But it will happen after "fetch_file" is done.

Is this code terrible? So first, yes - I would say an API to do an operation with no way to see what happened is bad. 

But if legacy code is callback based, this pattern can be quite useful - code that launches callbacks typically put the finalization of their operation in the callback and do nothing once launching the callback - the function is fire and forget because the end of the coroutine or callback handles the results of the operation.