Saturday, February 27, 2021

Making SoA Tollerable

Chandler Caruth (I think - I can't for the life of me find the reference) said something in a cppcon talk years ago that blew my mind. More or less, 95% of code performance comes from the memory layout and memory access patters of data structures, and 5% comes from clever instruction selection and instruction stream optimization.

That is...terrible news! Instruction selection is now pretty much entirely automated. LLVM goes into my code and goes "ha ha ha foolish human with your integer divide by a constant, clearly you can multiply by this random bit sequence that was proven to be equivalent by a mathematician in the 80s" and my code gets faster. There's not much I have to worry about on this front.

The data structures story is so much worse. I say "I'd like to put these bytes here" and the compiler says "very good sir" in sort of a deferential English butler kind of way. I can sense that maybe there's some judgment and I've made bad life choices, but the compiler is just going to do what I told it. "Lobster Thermidor encrusted in Cool Ranch Doritos, very good sir" and Alfred walks off to leave me in a hell of L2 cache misses of my own design that turn my i-5 into a 486.

I view this as a fundamental design limitation of C++, one that might someday be fixed with generative meta-programming (that is, when we can program C++ to write our C++, we can program it to take our crappy OOPy-goopy data structures and reorganize them into something the cache likes) but that is the Glorious Future™. For now, the rest of this post is about what we can do about it with today's C++.

There Is Only Vector

To go faster, we have to keep the CPU busy, which means not waiting for memory. The first step is to use vector and stop using everything else - see the second half of Chandler's talk. Basically any data structure where the next thing we need isn't directly after the thing we just used is bad because the memory might not be in cache.

We experienced this first hand in X-Plane during the port to Vulkan. Once we moved from OpenGL to Vulkan, our CPU time in driver code went way down - 10x less driver time - and all of the remaining CPU time was in our own code. The clear culprit was the culling code, which walks a hierarchical bounding volume tree to decide what to draw.

I felt very clever when I wrote that bounding volume tree in 2005. It has great O(N) properties and lets us discard a lot of data very efficiently. So much winning!

But also, it's a tree. The nodes are almost never consecutive, and a VTune profile is just a sea of cache misses each time we jump nodes. It's slow because it runs at the speed of main memory.

We replaced it with a structure that would probably cause you to fail CS 102, algorithms and data structures:

1. A bunch of data is kept in an array for a a sub-section of the scenery region.

2. The sub-sections are in an array.

And that's it. It's a tree of fixed design of depth two and a virtually infinite node count.

And it screams. It's absurdly faster than the tree it replaces, because pretty much every time we have to iterate to our next thing, it's right there, in cache. The CPU is good at understanding arrays and is going to get the next cache line while we work. Glorious!

There are problems so big that you still need O(N) analysis, non-linear run-times, etc. If you're like me and have been doing this for a long time, the mental adjustment is how big N has to be to make that switch. If N is 100, that's not a big number anymore - put it in an array and blast through it.

We Have To Go Deeper

So far all we've done is replaced every STL container with vector. This is something that's easy to do for new code, so I would say it should be a style decision - default to vector and don't pick up sets/maps/lists/whatever unless you have a really, really, really good reason.

But it turns out vector's not that great either. It lines up our objects in a row, but it works on whole objects. If we have an object with a lot of data, some of which we touch all of the time and some of which we use once on leap years, we waste cache space on the rarely used data. Putting whole objects into an array makes our caches smaller, by filling them up with stuff we aren't going to use because it happens to be nearby.

Game developers are very familiar with what to do about it - perhaps less so in the C++ community: vector gives us an array of structures - each object is consecutive and then we get to the next object; what we really want is a structure of arrays - each member of the object is consecutive and then we hit the next object.

Imagine we have a shape object with a location, a color, a type, and a label. In the structure of arrays world, we store 4 shapes by storing: [(location1, location2, location3, location4), (color 1, color 2, color3, color4), (type 1, type2, type3, type 4), (label 1, label2, label3, label4)].

First, let's note how much better this is for the cache. When we go looking to see if a shape is on screen, all locations are packed together; every time we skip a shape, the next shape's location is next in memory. We have wasted no cache or memory bandwidth on thing we won't draw. If label drawing is turned off, we can ignore that entire block of memory. So much winning!

Second, let's note how absolutely miserable this is to maintain in C++. Approximately 100% of our tools for dealing with objects and encapsulations go out the window because we have taken our carefully encapsulated objects, cut out their gooey interiors and spread them all over the place. If you showed this code to an OOP guru they'd tell you you've lost your marbles. (Of coarse, SoA isn't object oriented design, it's data oriented design. The objects have been minced on purpose!)

Can We Make This Manageable?

So the problem I have been thinking about for a while now is: how do we minimize the maintenance pain of structures of arrays when we have to use them? X-Plane's user interface isn't so performance critical that I need to take my polymorphic hierarchy of UI widgets and cut it to bits, but the rendering engine has a bunch of places where moving to SoA is the optimization to improve performance.

The least bad C++ I have come up with so far looks something like this:

struct scenery_thingie {

    int            count;

    float *        cull_x;

    float *        cull_y;

    float *        cull_z;

    float *        cull_radius;

    gfx_mesh *     mesh_handle;


    void alloc(UTL_block_alloc * alloc, int count);

    scenery_thingie& operator++();

    scenery_thingie& operator+=(int offset);

};

You can almost squint at this and say "this is an object with five fields", and you can almost squint and this and say "this is an array" - it's both! The trick is that each member field is a base pointer into the first object (of count's) member field, with the next fields coming consecutively. While all cull_y fields don't have to follow cull_x in memory, it's nice if they do - we'd rather not have them on different VM pages, for example.

Our SoA struct can both be an array (in that it owns the memory and has the base pointers) but it can also be an iterator - the increment operator increments each of the base pointers. In fact, we can easily build a sub-array by increasing the base pointers and cutting the count, and iteration is just slicing off smaller sub-arrays in place - it's very cheap.

This turns out to be pretty manageable! We end up writing *iter.cull_x instead of iter->cull_x, but we more or less get to work with our data as expected.

Where Did the Memory Come From?

We have one problem left: where did the memory come from to allocate our SoA? We need a helper - something that will "organize" our dynamic memory request and set up our base pointers to the right locations. This code is doing what operator new[] would have done.

class UTL_block_alloc {

public:


    UTL_block_alloc();


    template<typename T>

    inline void alloc(T ** dest_ptr, size_t num_elements);


    void *    detach();

};

Our allocation block helper takes a bunch of requests for arrays of T's (e.g. arbitrary types) and allocates one big block that allocates them consecutively, filling in dest_ptr to point to each one. When we call detach, the single giant malloc() block is returned to be freed by client code.

We can feed any number of SoA arrays via a single alloc block, letting us pack an entire structure of arrays of structures into one consecutive memory region. With this tool, "alloc" of an SoA is pretty easy to write.

void scenery_thingie::alloc(UTL_block_alloc * a, int in_count)

{

    count = in_count;

    a->alloc(&cull_x,c);

    a->alloc(&cull_y,c);

    a->alloc(&cull_z,c);

    a->alloc(&cull_r,c);

    a->alloc(&mesh_handle,c);

}

A few things to note here:

  • The allocation helper is taking the sting out of memory layout by doing it dynamically at run-time. This is probably fine - the cost of the pointer math is trivial compared to actually going and getting memory from the OS.
  • When we iterate, we are using memory to find our data members. While there exists some math to find a given member at a given index, we are storing one pointer per member in the iterator instead of one pointer total.
One of these structs could be turned into something that looks more like a value type by owning its own memory, etc. but in our applications I have found that several SoAs tend to get grouped together into a bigger 'system', and letting the system own a single block is best. Since we have already opened the Pandora's box of manually managing our memory, we might as well group things complete and cut down allocator calls while getting better locality.

Someday We'll Have This

Someday we'll have meta-programing, and when we do, it would be amazing to make a "soa_vector" that, given a POD data type, generates something like this:

struct scenery_thingie {

    int            count;

    int            stride

    char *         base_ptr;

    float&         cull_x() { return (*(float *) base_ptr); }

    float&         cull_y() { return *((float *) base_ptr + 4 * stride); }

    float&         cull_z() { return *((float *) base_ptr + 8 * stride); }

    /* */

};


I haven't pursued this in our code because of the annoyance of having to write and maintain the offset-fetch macros by hand, as well as the obfuscation of what the intended data layout really is. I am sure this is possible now with TMP, but the cure would be worse than the disease. But generative meta-programming I think does promise this level of optimized implementation from relatively readable source code.

Nitty Gritty - When To Interleave

One last note - in my example, I split the X, Y and Z coordinates of my culling volume into their own arrays. Is this a good idea?  If it was a vec3 struct (with x,y,z members) what should we have done?

The answer is ... it depends? In our real code, X, Y and Z are separate for SIMD friendliness - a nice side effect of separating the coordinates is that we can load four objects into four lanes of a SIMD register and then perform the math for four objects at once. This is the biggest SIMD win we'll get - it is extremely cache efficient, we waste no time massaging the data into SIMD format, and we get 100% lane utilization. If you have a chance to go SIMD, separate the fields.

But this isn't necessarily best. If we had to make a calculation based on XYZ, together, and we always use them together and we're not going to SIMD them, it might make sense to pack them together (e..g so our data went XYZXYZXYZXYZ, etc.). This would mean fetching position would require only one stride in memory and not three. It's not bad to have things together in cache if we want them together in cache.


Wednesday, October 21, 2020

A Tip for HiZ SSR - Parametric 't' Tracing

HiZ tracing for screen space reflections is an optimization where the search is done using a hierarchial Z-Buffer (typically stashed in a mip-chain) - the search can take larger skips when the low-res Z buffer indicates that there could not possibly be an occluder. This changes each trace from O(N) to O(logN).

The technique was published in GPU Pro 5, but as best I can tell, the author found out after writing the article that he couldn't post working sample code. The result is a tough chapter to make heads or tales of, because some parts of the algorithm simply say "see the sample code".  This forum thread is actually pretty useful, as is this.

The article builds a ray that is described parametrically in "Z" space, starting at the near clip plane (0.0 screen-space Z, must be a DX programmer ;-) and going out to 1.0.

If your app runs using reverse-float Z and you reverse this (starting the march at 1 and going in the -Z direction), you're going to get a ton of precision artifacts.  The reason: our march has the lowest precision at the start of the march.  In reverse-float Z there's a lot of hyper-Z (screen-space) depth "wasted" around the near clip plane - that's okay because it's the 'good' part of our float range, which gets better with distance. But in our case, it's going to make our ray testing a mess.

The technique is also presented tracing only near occluders and tracing only away from the camera - this is good if you view a far away mountain off a lake but not good if you view a building through a puddle from nearly straight down.

As it turns out, all of these techniques can be addressed with one restructure: parametrically tracing the ray using a parametric variable "t" instead of the actual Z buffer.

In this modification, the beginning of the ray cast is always at t = 0.0, and the end of the ray-cast (a full Z unit away) is always at t = 1.0, regardless of whether that is in the positive or negative Z direction - our unit is normalized so that its Z magnitude is either +1.0 or -1.0, which saves us a divide when intersecting with Z planes.

What does this solve?  A few things:

  1. The algorithm has high precision at the beginning of the trace because t has small magnitude - we want this for accurate tracing of close occluders and odd angles.
  2. The algorithm is fully general whether the ray is cast toward or away from the camera - no conditional code inside the search loop.  Lower 't' is always "sooner" in the ray cast.
  3. We can now march around an occluder if we can attribute a min and max depth to it, by seeing if the range of t within a search cell overlaps the range of t in our min-max Z buffer.  This is fully general for "in front" and "behind" a cast and for "toward" and "away".
I didn't see any mention of changing the parameterization in my web searching, nor did I see anyone else complaining about the precision loss from reverse-Z (it took me a good day or two to realize why my floating point was underperforming on that one) so I figured I'd put it out there.

Thursday, June 18, 2020

Second. Worst. Lock. Ever.

Four years ago I wrote a short post describing a dumb race condition in our reference counted art assets.

To recap the problem: X-Plane's art assets are immutable and reference counted, so they can be accessed lock-free (read-only) from rendering threads.

But X-Plane also has a lookup table from file path to art asset that is functionally a table of weak references; on creation of an art asset we use the table to find that we already loaded the art asset and bump its count.  On destruction we have to grab the table lock to clear out the entry.

So version one of the code, which was really really bad, looked like this:

void object::release()
{
  if(m_count.decrement() == 0)
  {
    // Race goes here
    RAII_lock (m_table_lock());
    m_table.erase(this->name);
    delete this;
  }
}
This code is bad because after we decrement our reference count, but before we lock the table, another thread can go in, lock the table, find our art asset, increment its reference count and unlock the table - this would be caused by an async load of the same art asset (in another thread) hitting the "fast path".  We then delete a non-zero-count object.

The fix at the time was this:

void object::release()
{
  RAII_lock (m_table_lock());
  if(m_count.decrement() == 0)
  {
    m_table.erase(this->name);
    delete this;
  }
}

Since the table is locked before the decrement, no one can get in and grab our object - we block out all other asset loaders before we decrement; if we hit zero reference count, we take out the sledge hammer and trash the object.

Correct But Stupid

The problem with this new design is that it holds the table lock across every release operation - even ones where there is no chance of actually releasing the object.

We hold the table lock during asset creation - the API contract for loaders is that you get back a valid* C++ object representing the art asset when the creation API returns, so this effectively means we have to hold the lock so that a second thread loading the same object can't return the partially constructed object being built by a first thread.  This means the lock isn't a spin lock - it can be held across disk access for tens of milliseconds.

Well, that's not good. What happens when you put your object into a C++ smart handle that retains and releases the reference count in the constructor/destructor?

The answer is: you end up calling release all over the place and are constantly grabbing the table lock for one atomic op, and sometimes you're going to get stuck because someone else is doing real loading.

The reason this is a total fail is: client code would not expect that simply moving around ownership of the reference would be a "slow" operation the way true allocation/deallocation is.  If you say "I release an art asset on the main thread and the sim glitched" I tell you you're an idiot. If you say "my vector resized and I locked the sim for 100 ms", that's not a good API.

Third Time's a Charm

The heart of the bug is that we eat the expensive table lock when we release regardless of whether we need it.  So here's take three:

void object::release()
{
void object::release()
{
  if(m_count.decrement() == 0)
  {
    RAII_lock (m_table_lock());
    // If someone beat us to the table lock, check and abort.
    if(m_count.load() > 0)
      return;
    m_table.erase(this->name);
    delete this;
  }
}

This is sort of like double-checked locking: we do an early first check of the count to optimize out the table lock when it is obvious we aren't the deleter (our reference count is greater than zero after we decrement).  Once we take the table lock, we then re-check that no one beat us into the table between the decrement and the lock, and if we are still okay, we delete.

The win here is that we only take the table lock in the case where we are very likely to deallocate - and client code should only be hitting that case if the client code is prepared to deallocate a resource, which is never fast. With this design, as long as resource deallocation (at the client level) is in the background with resource creation, we never hit the table lock from any critical rendering path or incidental book-keeping.

* With the Vulkan renderer we now have art assets that complete some of their loading asynchronously - this is more or less mandatory because DMA transfers to the GPU are always asynchronous. So the synchronous part of loading is establishing a C++ object functional enough to "do useful stuff with it."

We could start to erode this time by having more functionality be asynchronously available and less be always-guaranteed.  But in practice it's not a real problem because entire load operations on the sim itself are already in the background, so lowering load latency doesn't get us very much real win.

Thursday, April 16, 2020

Specular Hilites Have Their Own Depth

I just noticed this effect while I was brushing my teeth. The faucet in the sink is:

  • Chrome (or stainless steel) or some other reflective "metal" and
  • Reasonably glossy and
  • Kinda dirty.
As I looked down, I could see specular reflections from the lights above the mirror and I could see calcium deposits on the surface of the faucet itself.

And...then I noticed that they didn't have the same parallax.  Close one eye, then the other, and the relative placement of the specular hilites with regards to the calcium deposits changes.

In other words, the specular hilites are farther away than the surface they reflect in.

If you take a step back and think about this, it's completely logical. An image of yourself a mirror appears twice as far away as the mirror itself, and if you draw an optical diagram, this is not surprising.  Twice as much eye separation parallax is 'washed out' by distance on the full optical path as to the surface of the mirror.

Interestingly, X-Plane's VR system correctly simulates this. We do per-pixel lighting calculations separately on each eye using a camera origin (and thus transform stack and light vector) specific to each eye. When I first coded this, I thought it was odd that the lighting could be "inconsistent" between eyes, but it never looked bad or wrong.

Now I realized that that inconsistency is literally a depth queue.


Friday, November 29, 2019

How to Make the OS X Terminal Change Colors for Remote Servers

A thing I have done is attempt to hard shut down my Mac using the shell (e.g. sudo shutdown -r) and accidentally taken down one of our servers that I was ssh'd into.

To fix this and generally have more awareness of what the heck I am doing, I use a .profile script to change the Terminal theme when logging into and out of remote servers via SSH.


echo BashRC

function tabc {
  NAME=$1; if [ -z "$NAME" ]; then NAME="Default"; fi
  osascript -e "tell application \"Terminal\" to set current settings of front window to settings set \"$NAME\""
}

function ssh {
  tabc "Pro"
  /usr/bin/ssh "$@"
  tabc "Basic"
}

I am not sure where this comes from - possibly here, but I see a few web references and I don't know which one I originally found. I'm posting it here so that I can find it the next time I get a new machine.

(Redoing this was necessary because I didn't migration-assistant my new laptop, and since it was going to Catalina, that was probably for the best.)

The script goes into ~/.profile for OS X 10.14 and older, or ~/.zprofile for OS X 10.15.

Tuesday, October 29, 2019

Klingons Do Not Release Software

Brent Simmons on ETAs for software releases:
This is all just to say that app-making is nothing like building a house. It’s more like building the first house ever in the history of houses, with a pile of rusty nails and your bare hands, in a non-stop tornado.
He put this at the end of the article, but I think it's the most important lesson here: it's hard to estimate how long a coding task will take because every coding task that you undertake that is worth something to your business is the very first time your business has made anything like that task.

This comes from the fundamentally reusable nature of software.  We see this with X-Plane.  We have decent estimates for how long aircraft will take to build because each aircraft, while different from the others in terms of its aerodynamics, look, texturing, modeling, is the same in terms of the kind of tasks: 3-d modeling, UV mapping, animation, etc.

On the software side, we do everything exactly once. Once we rewrite the entire rendering engine to run under Vulkan, we will never do that again, because we will have Vulkan support, and coding it twice would be silly.

Almost every software feature ends up like this - the feature is fundamentally novel compared to the last one because you don't need a second copy of code.

Brent finishes up the article with this:
The only reason anything ever ships is because people just keep working until it’s ready.
But this is only half-true. Code also ships when the scope of the project is reduced to the things you have already finished. And I think this is important to remember because reducing scope is the one thing that actually reduces ship times.

This reminded me of the old nerd joke:
Klingons do not release software. Klingon software escapes, leaving a bloody trail of design engineers and quality assurance testers in its wake.
On a big, unwieldy, messy release with seat-of-the-pants project management, that's a pretty good description of what happens - at some point the overwhelming business force of needing to ship dominates what's left in the todo pile and the software smashes through the wall, Kool-Aid style and gives the users a big "Oh Yeah!"

I think Brent's definition is right, as long as we recognize that "ready" is a dynamic term that can change based on market conditions and the better understanding of what you really need to ship that you get from implementation. If your app figures out it's ready before you do, you might have Klingon software on your hands.

Saturday, September 14, 2019

Hardware Performance: Old and New iPhones and PCs

I am looking at performance on our latest revision of X-Plane Mobile, and I've noticed that the performance bottlenecks have changed from past versions.

In the past, we have been limited by some of everything - vertex count, CPU-side costs (mostly in the non-graphics code, which is ported from desktop), and fill rate/shading costs. This time around, something has changed - performance is pretty much about shading costs, and that's the whole ballgame.

This is a huge relief! Reducing shading costs is a problem for which real-time graphics has a lot of good solutions. It's normally the problem, so pretty much all rendering techniques find ways to address it.

How did this shift happen? I think part of it is that Apple's mobile CPUs have just gotten really, really good. The iPhone X has a single-core GeekBench 4 score of 4245. By comparison, the 2019 iMac with an i5-8500 at 3 ghz gets 5187. That's just not a huge gap between a full-size gaming mid-range desktop computer and...a phone.

In particular, we've noticed that bottlenecks that used to show up only on mobile have more or less gone away. My take on this is that Apple's mobile chips can now cope with not-hand-tuned code as well Intel's can, e.g. less than perfect memory access patterns, data dependencies, etc. (A few years ago, code that used to be good enough that hand tuning wouldn't be a big win would need some TLC for mobile. It's a huge dev time win to have that not be the case anymore.)

If I can summarize performance advice for today's mobile devices in a single item, it would be: "don't blend." The devices easily have enough ALU and texture bandwidth to run complex algorithms (like full PBR) at high res as long as you're not shading the full screen multiple times. Because the GPUs are tiling, the GPU will eliminate virtually any triangles that are not visible in the final product as long as blending is off.

By comparison, on desktop GPUs we find that utilization is often the biggest problem - that is, the GPU has the physical ALU and bandwidth to over-draw the scene multiple times if blending is left on, but the frame is made up of multiple smaller draw calls, and the GPU can't keep the entire card fed given the higher frequency of small jobs.

(We did try simplifying the shading environment - a smaller number of shaders doing more work by moving what used to be shader changes into conditional logic on the GPU. It was not a win! I think the problem is that if any part of a batch is changed, it's hard for the GPU to keep a lot of work in flight, even if the batch is "less different" than before optimization. Since we couldn't get down to "these batches are identical, merge them" we ended up with similarly poor utilization and higher ALU costs while shading.)