Monday, June 21, 2021

A Coroutine Can Be An Awaitable

This is part five of a five part series on coroutines in C++.

  1. Getting Past the Names
  2. Coroutines Look Like Factories
  3. co_await is the .then of Coroutines
  4. We Never Needed Stackful Coroutines
  5. A Coroutine Can Be An Awaitable

The relationship between awaitables and coroutines is a little bit complicated.

First, an awaitable is anything that a coroutine can "run after". So an awaitable could be an object like a mutex or an IO subsystem or an object representing an operation (I "awaited" my download object and ran later once the download was done).

I suppose you could say that an awaitable is an "event source" where the event is the wait being over. Or you could say events can be awaitables and it's a good fit.

A coroutine is something that at least partly "happens later" - it's the thing that does the awaiting. (You need a coroutine to do the awaiting because you need something that is code to execute. The awaitable might be an object, e.g. a noun.)

Where things get interesting is that coroutines can be awaitables, because coroutines (at least ones that don't infinite loop) have endings, and when they end, that's something that you can "run after". The event is "the coroutine is over."

To make your coroutine awaitable you need to do two things:

  1. Give it a co_await operator so the compiler understands how to build an awaitable object that talks to it and
  2. Come up with a reason to wake the caller back up again later.
Lewis Baker's cppcoro task is a near-canonical version of this.
  • The tasks start suspended, so they run when you co_await them.
  • They use their final_suspend object to resume the coroutine that awaited them to start them off.
Thus tasks are awaitables, and they are able to await (because they are croutines) and they can be composed indefinitely.

While any coroutine can be an awaitable, they might no be. I built a "fire and forget" async coroutine that cleans itself up when it runs off the end - it's meant to be a top level coroutine that can be run from synchronous code, thus it doesn't try to use coroutine tech to signal back to its caller. The actual C++ in the coroutine need to decide how to register their final results with the host app, maybe by calling some other non-coroutine execution mechanism.

Sunday, June 20, 2021

We Never Needed Stackfull Coroutines

This is part for of a five part series on coroutines in C++.

  1. Getting Past the Names
  2. Coroutines Look Like Factories
  3. co_await is the .then of Coroutines
  4. We Never Needed Stackful Coroutines
  5. A Coroutine Can Be An Awaitable

You've probably seen libraries that provide a "cooperative threading" API, e.g. threads of execution that yield to each other at specific times, rather than when the Kernel decides to. Windows Fibers, Boost Fibers, ucontext on Linux. I've been programming for a very long time, so I have written code for MacOS's cooperative "Thread Manager", back before cooperative multi-tasking was cool - at the time we were really annoyed that there was no pre-emption and that one bad worker thread would hang your whole machine. We were truly savages back then.

The gag for all of these "fiber" systems is the same - you have these 'threads' that only yield the CPU to another one via a yield API call - the yield call may name its successor or just pick any runnable non-running thread. Each thread has its own stack, thus our function (and its callstack and all state) is suspended mid-execution at the yield call.

Sounds a bit like coroutines, right?

There are two flavors of coroutines - stackful, and stackless. Fibers are stackful coroutines - the whole stack is saved to suspend the routine, allowing us to suspend at any point in execution anywhere.

C++20 coroutines are stackless, which means you can only suspend inside a coroutine itself - suspension depends on the compiler transforms of the coroutine (into a state machine object) to achieve the suspend.

When I first heard this, I thought: "well, performance is good but I wonder if I'll miss having full stack saves."

As it turns out, we don't need stack saves - here's why.

First, a coroutine is a function that has at least one potential suspend point. If there are zero potential suspend points, it's just a function, like from the 70s with the weed an the K&R syntax, and to state the obvious, we don't have to worry about a plain old function trying to suspend on us.

So if a coroutine calls a plain old function, the function can't suspend, so not being able to save the whole stack isn't important. We only need ask: "what happens if a coroutine calls a coroutine, and the child coroutine suspends?"

Well, we don't really "call" a child coroutine - we co_await it! And co_awaiting Just Works™.

  • When our parent coroutine co_awaits the child routine, the child coroutine (via the awaitable mediating this transfer of execution) gets a handle to the parent coroutine to stash for later. We can think of this as the "return address" from a stack function call, and the child coroutine can stash it somewhere in its promise storage.
  • When the child coroutine co-awaits (e.g. on I/O) this works as expected - we don't return to the parent coroutine because it's already suspended.
  • When the child coroutine finishes for real, its final suspend awaitable can return the parent coroutine handle (that we stashed) to transfer control back to the parent - this is the equivalent of using a RET opcode to pop the stack and jump to the return address.
In other words, our chain of co-routines forms a synthetic stack-like saving structure - the coroutine frames form a linked list (from children to parents) via the stashed handle to the coroutine) and each frame stores state for that function.



Saturday, June 19, 2021

co_await is the .then of Coroutines

This is part three of a five part series on coroutines in C++.

  1. Getting Past the Names
  2. Coroutines Look Like Factories
  3. co_await is the .then of Coroutines
  4. We Never Needed Stackful Coroutines
  5. A Coroutine Can Be An Awaitable

One more intuition about coroutines: the co_wait operator is to coroutines as .then (or some other continuation queueing routine) is to a callback/continuation/closure system.

In a continuation-based system, the fundamental thing we want to express is "happens after". We don't care when our code runs as long as it is after the stuff that must run before completes, and we'd rather not block or spin a thread to ensure that. Hence:

void dump_file(string path)

{

    future<string> my_data = file.async_load("~/stuff.txt");

    my_data.then([my_data](){

        printf("File contained: %s\n", my_data.get().c_str());

    });

}

The idea is that our file dumper will begin the async loading process and immediately return to the caller once we're IO bound. At some time later on some mystery thread (that's a separate rant) the IO will be done and our future will contain real data. At that point, our lambda runs and can use the data.

The ".then" API ensures that the file printing (which requires the data) happens after the file I/O.

With coroutines, co_await provides the same service.

async dump_file(string path)

{

    string my_data = co_await file.async_load("~/stuff.txt");

    printf("File contained: %s\n", my_data.c_str());

}

Just like before, the beginning of dump_file runs on the caller's thread and runs the code to begin the file load. Once we are I/O bound, we exit all the way back to the caller; some time later the rest of our coroutine will run (having access to the data) on a thread provided by the IO system.

Once we realize that co_await is the .then of coroutines, we can see that anything in our system with an async continuation callback could be an awaitable. A few possibilities:

  • APIs that move execution to different thread pools ("run on X thread")
  • Non-blocking I/O APIs that call a continuation once I/O is complete.
  • Objects that load their internals asynchronously and run a continuation once they are fully loaded.
  • Serial queues that guarantee only one continuation runs at a time to provide non-blocking async "mutex"-like behavior.
Given an API that takes a continuation, we can transform it into an awaitable by wrapping the API in an awaitable with a continuation that "kicks" the awaitable to resume the suspended coroutine.

Awaitables are also one of two places where you get to find out who any given coroutine is - await suspend gives you the coroutine handle of the client coroutine awaiting on you, which means you can save it and put it in a FIFO or anywhere else for resumption later.  You also get to return a coroutine handle to switch to any other code execution path you want.

For example, a common pattern for a coroutine that computes something is:
  1. Require the client to co_await the coroutine to begin execution - the awaitable that does this saves the client's coroutine handle.
  2. Use a final_suspend awaitable to return to the client whose coroutine we stashed.
(This is what cppcoro's task does.)

The other place we can get a coroutine handle is in the constructor of our promise, which can use from_promise to find the underlying coroutine it is attached to, and pass it to the return type, allowing handles to coroutines to connect to their coroutines.

Friday, June 18, 2021

Coroutines Look Like Factories

This is part two of a five part series on coroutines in C++.

  1. Getting Past the Names
  2. Coroutines Look Like Factories
  3. co_await is the .then of Coroutines
  4. We Never Needed Stackful Coroutines
  5. A Coroutine Can Be An Awaitable

In my previous post I complained about the naming of the various parts of coroutines - the language design is great, but I find myself having to squint at the parts sometimes.

Before proceeding, a basic insight about how coroutines work in C++.

You write a coroutine like a function (or method or lambda), but the world uses it like a factory that returns a handle to your newly created coroutine.

One of the simplest coroutine types I was able to write was an immediately-running, immediately-returning coroutine that returns nothing to the caller - something like this:

class async {

public:

    struct control_block {

        auto initial_suspend() { return suspend_never(); }

        auto final_suspend() { return suspend_never(); }

        void return_void() {}

        void unhandled_exception() { }

    };

    using promise_type = control_block;

};

The return type "async" returns a really useless handle to the client. It's useless because the coroutine starts on its own and ends on its own - it's "fire and forget".  The idea is to let you do stuff like this:

async fetch_file(string url, string path)

{

    string raw_data = co_await http::download_url(url);

    co_await disk::write_file_to_path(path, raw_data);

}

In this example, our coroutine suspends on IO twice, first to get data from the internet, then to write it to a disk, and then it's done.  Client code can do this:

void get_files(vector<pair<string,string>> stuff)

{

    for(auto spec : stuff)

    {

        fetch_file(spec.first,spec.second);
    }
}

To the client code, fetch_file is a "factory" that will create one coroutine for each file we want to get; that coroutine will start executing using the caller for get_files, do enough work to start downloading, and then return.  We'll queue a bunch of network ops in a row.

How does the coroutine finish? The IO systems will resume our coroutine once data is provided. What thread is executing this code? I have no idea - that's up to the IO system's design. But it will happen after "fetch_file" is done.

Is this code terrible? So first, yes - I would say an API to do an operation with no way to see what happened is bad. 

But if legacy code is callback based, this pattern can be quite useful - code that launches callbacks typically put the finalization of their operation in the callback and do nothing once launching the callback - the function is fire and forget because the end of the coroutine or callback handles the results of the operation.

Thursday, June 17, 2021

C++ Coroutines - Getting Past the Names

This is part one of a five part series on coroutines in C++.

  1. Getting Past the Names
  2. Coroutines Look Like Factories
  3. co_await is the .then of Coroutines
  4. We Never Needed Stackful Coroutines
  5. A Coroutine Can Be An Awaitable

I really like the C++20 coroutine language features - coroutines are a great way to write asynchronous code, and the language facilities give us a lot of power and client code that should be very manageable to write.

My one gripe is with the naming of a few of the various parts of the coroutine control structures. This shouldn't matter that much because only library writers have to care, but right now we're in the library wild west (which is fine by me - I'm old enough to be very skeptical of "the standard library should have everything" and happy to roll my own) so we can't avoid them.

A coroutine's overall behavior is mediated by its declared return type, which will typically be a struct or class from library code that looks something like this:

template<typename T>

struct task {

  struct task_promise_thingie {

    auto initial_suspend() { return std::suspend_always(); }

    ...

  };

  using promise_type = task_promise_thingie;

};

The inner struct (it could be separate but the nested structure expresses the relationship to client code in ar reasonable way I think) is called the "promise type" and ... man, do I hate that name. Partly because I consider the whole promise-future async model from C++11 to be bad, and partly because the promise type" does a lot of things, of which the promise might be the least important.

I'm not sure what I would have called the promise type, but I'd lean toward "traits" or "control block", because the promise type does a few key things for you:

  1. It defines the beginning and end of the lifecycle of your co-routine. Does the coroutine immediately start when created, or does it immediately suspend until someone kicks it? Does the coroutine run off the end and die on its own or wait until someone kills it off.
  2. It defines the flow control on the CPU at the beginning and end of the co-routine. Because the beginning and end of a coroutine's life are controlled by awaitables, you can launch an arbitrary coroutine at either point! For example, if the coroutine delivers its output to another coroutine, you can, at final suspend, resume that parent co-routine on whatever thread you ended on.
  3. It defines the "handle" type that client code will see when creating the coroutine.
This brings me to my second rant - the outer handle that client code gets when "calling" the coroutine (which is really constructing it) is called the "return type". This is harder to yell about because it is...a return type...but once again, I think this isn't the most important thing.

I would say the most important thing about the coroutine's return type is that it's the only access to the coroutine that client code gets. So if the client code that calls/builds the coroutine is going to have any further interaction with it, the return type is the "handle".  Left to my own byzantine naming instincts, I might have called it a "client handle" or something.

Saturday, February 27, 2021

Making SoA Tollerable

Chandler Caruth (I think - I can't for the life of me find the reference) said something in a cppcon talk years ago that blew my mind. More or less, 95% of code performance comes from the memory layout and memory access patters of data structures, and 5% comes from clever instruction selection and instruction stream optimization.

That is...terrible news! Instruction selection is now pretty much entirely automated. LLVM goes into my code and goes "ha ha ha foolish human with your integer divide by a constant, clearly you can multiply by this random bit sequence that was proven to be equivalent by a mathematician in the 80s" and my code gets faster. There's not much I have to worry about on this front.

The data structures story is so much worse. I say "I'd like to put these bytes here" and the compiler says "very good sir" in sort of a deferential English butler kind of way. I can sense that maybe there's some judgment and I've made bad life choices, but the compiler is just going to do what I told it. "Lobster Thermidor encrusted in Cool Ranch Doritos, very good sir" and Alfred walks off to leave me in a hell of L2 cache misses of my own design that turn my i-5 into a 486.

I view this as a fundamental design limitation of C++, one that might someday be fixed with generative meta-programming (that is, when we can program C++ to write our C++, we can program it to take our crappy OOPy-goopy data structures and reorganize them into something the cache likes) but that is the Glorious Future™. For now, the rest of this post is about what we can do about it with today's C++.

There Is Only Vector

To go faster, we have to keep the CPU busy, which means not waiting for memory. The first step is to use vector and stop using everything else - see the second half of Chandler's talk. Basically any data structure where the next thing we need isn't directly after the thing we just used is bad because the memory might not be in cache.

We experienced this first hand in X-Plane during the port to Vulkan. Once we moved from OpenGL to Vulkan, our CPU time in driver code went way down - 10x less driver time - and all of the remaining CPU time was in our own code. The clear culprit was the culling code, which walks a hierarchical bounding volume tree to decide what to draw.

I felt very clever when I wrote that bounding volume tree in 2005. It has great O(N) properties and lets us discard a lot of data very efficiently. So much winning!

But also, it's a tree. The nodes are almost never consecutive, and a VTune profile is just a sea of cache misses each time we jump nodes. It's slow because it runs at the speed of main memory.

We replaced it with a structure that would probably cause you to fail CS 102, algorithms and data structures:

1. A bunch of data is kept in an array for a a sub-section of the scenery region.

2. The sub-sections are in an array.

And that's it. It's a tree of fixed design of depth two and a virtually infinite node count.

And it screams. It's absurdly faster than the tree it replaces, because pretty much every time we have to iterate to our next thing, it's right there, in cache. The CPU is good at understanding arrays and is going to get the next cache line while we work. Glorious!

There are problems so big that you still need O(N) analysis, non-linear run-times, etc. If you're like me and have been doing this for a long time, the mental adjustment is how big N has to be to make that switch. If N is 100, that's not a big number anymore - put it in an array and blast through it.

We Have To Go Deeper

So far all we've done is replaced every STL container with vector. This is something that's easy to do for new code, so I would say it should be a style decision - default to vector and don't pick up sets/maps/lists/whatever unless you have a really, really, really good reason.

But it turns out vector's not that great either. It lines up our objects in a row, but it works on whole objects. If we have an object with a lot of data, some of which we touch all of the time and some of which we use once on leap years, we waste cache space on the rarely used data. Putting whole objects into an array makes our caches smaller, by filling them up with stuff we aren't going to use because it happens to be nearby.

Game developers are very familiar with what to do about it - perhaps less so in the C++ community: vector gives us an array of structures - each object is consecutive and then we get to the next object; what we really want is a structure of arrays - each member of the object is consecutive and then we hit the next object.

Imagine we have a shape object with a location, a color, a type, and a label. In the structure of arrays world, we store 4 shapes by storing: [(location1, location2, location3, location4), (color 1, color 2, color3, color4), (type 1, type2, type3, type 4), (label 1, label2, label3, label4)].

First, let's note how much better this is for the cache. When we go looking to see if a shape is on screen, all locations are packed together; every time we skip a shape, the next shape's location is next in memory. We have wasted no cache or memory bandwidth on thing we won't draw. If label drawing is turned off, we can ignore that entire block of memory. So much winning!

Second, let's note how absolutely miserable this is to maintain in C++. Approximately 100% of our tools for dealing with objects and encapsulations go out the window because we have taken our carefully encapsulated objects, cut out their gooey interiors and spread them all over the place. If you showed this code to an OOP guru they'd tell you you've lost your marbles. (Of coarse, SoA isn't object oriented design, it's data oriented design. The objects have been minced on purpose!)

Can We Make This Manageable?

So the problem I have been thinking about for a while now is: how do we minimize the maintenance pain of structures of arrays when we have to use them? X-Plane's user interface isn't so performance critical that I need to take my polymorphic hierarchy of UI widgets and cut it to bits, but the rendering engine has a bunch of places where moving to SoA is the optimization to improve performance.

The least bad C++ I have come up with so far looks something like this:

struct scenery_thingie {

    int            count;

    float *        cull_x;

    float *        cull_y;

    float *        cull_z;

    float *        cull_radius;

    gfx_mesh *     mesh_handle;


    void alloc(UTL_block_alloc * alloc, int count);

    scenery_thingie& operator++();

    scenery_thingie& operator+=(int offset);

};

You can almost squint at this and say "this is an object with five fields", and you can almost squint and this and say "this is an array" - it's both! The trick is that each member field is a base pointer into the first object (of count's) member field, with the next fields coming consecutively. While all cull_y fields don't have to follow cull_x in memory, it's nice if they do - we'd rather not have them on different VM pages, for example.

Our SoA struct can both be an array (in that it owns the memory and has the base pointers) but it can also be an iterator - the increment operator increments each of the base pointers. In fact, we can easily build a sub-array by increasing the base pointers and cutting the count, and iteration is just slicing off smaller sub-arrays in place - it's very cheap.

This turns out to be pretty manageable! We end up writing *iter.cull_x instead of iter->cull_x, but we more or less get to work with our data as expected.

Where Did the Memory Come From?

We have one problem left: where did the memory come from to allocate our SoA? We need a helper - something that will "organize" our dynamic memory request and set up our base pointers to the right locations. This code is doing what operator new[] would have done.

class UTL_block_alloc {

public:


    UTL_block_alloc();


    template<typename T>

    inline void alloc(T ** dest_ptr, size_t num_elements);


    void *    detach();

};

Our allocation block helper takes a bunch of requests for arrays of T's (e.g. arbitrary types) and allocates one big block that allocates them consecutively, filling in dest_ptr to point to each one. When we call detach, the single giant malloc() block is returned to be freed by client code.

We can feed any number of SoA arrays via a single alloc block, letting us pack an entire structure of arrays of structures into one consecutive memory region. With this tool, "alloc" of an SoA is pretty easy to write.

void scenery_thingie::alloc(UTL_block_alloc * a, int in_count)

{

    count = in_count;

    a->alloc(&cull_x,c);

    a->alloc(&cull_y,c);

    a->alloc(&cull_z,c);

    a->alloc(&cull_r,c);

    a->alloc(&mesh_handle,c);

}

A few things to note here:

  • The allocation helper is taking the sting out of memory layout by doing it dynamically at run-time. This is probably fine - the cost of the pointer math is trivial compared to actually going and getting memory from the OS.
  • When we iterate, we are using memory to find our data members. While there exists some math to find a given member at a given index, we are storing one pointer per member in the iterator instead of one pointer total.
One of these structs could be turned into something that looks more like a value type by owning its own memory, etc. but in our applications I have found that several SoAs tend to get grouped together into a bigger 'system', and letting the system own a single block is best. Since we have already opened the Pandora's box of manually managing our memory, we might as well group things complete and cut down allocator calls while getting better locality.

Someday We'll Have This

Someday we'll have meta-programing, and when we do, it would be amazing to make a "soa_vector" that, given a POD data type, generates something like this:

struct scenery_thingie {

    int            count;

    int            stride

    char *         base_ptr;

    float&         cull_x() { return (*(float *) base_ptr); }

    float&         cull_y() { return *((float *) base_ptr + 4 * stride); }

    float&         cull_z() { return *((float *) base_ptr + 8 * stride); }

    /* */

};


I haven't pursued this in our code because of the annoyance of having to write and maintain the offset-fetch macros by hand, as well as the obfuscation of what the intended data layout really is. I am sure this is possible now with TMP, but the cure would be worse than the disease. But generative meta-programming I think does promise this level of optimized implementation from relatively readable source code.

Nitty Gritty - When To Interleave

One last note - in my example, I split the X, Y and Z coordinates of my culling volume into their own arrays. Is this a good idea?  If it was a vec3 struct (with x,y,z members) what should we have done?

The answer is ... it depends? In our real code, X, Y and Z are separate for SIMD friendliness - a nice side effect of separating the coordinates is that we can load four objects into four lanes of a SIMD register and then perform the math for four objects at once. This is the biggest SIMD win we'll get - it is extremely cache efficient, we waste no time massaging the data into SIMD format, and we get 100% lane utilization. If you have a chance to go SIMD, separate the fields.

But this isn't necessarily best. If we had to make a calculation based on XYZ, together, and we always use them together and we're not going to SIMD them, it might make sense to pack them together (e..g so our data went XYZXYZXYZXYZ, etc.). This would mean fetching position would require only one stride in memory and not three. It's not bad to have things together in cache if we want them together in cache.