Wednesday, October 21, 2020

A Tip for HiZ SSR - Parametric 't' Tracing

HiZ tracing for screen space reflections is an optimization where the search is done using a hierarchial Z-Buffer (typically stashed in a mip-chain) - the search can take larger skips when the low-res Z buffer indicates that there could not possibly be an occluder. This changes each trace from O(N) to O(logN).

The technique was published in GPU Pro 5, but as best I can tell, the author found out after writing the article that he couldn't post working sample code. The result is a tough chapter to make heads or tales of, because some parts of the algorithm simply say "see the sample code".  This forum thread is actually pretty useful, as is this.

The article builds a ray that is described parametrically in "Z" space, starting at the near clip plane (0.0 screen-space Z, must be a DX programmer ;-) and going out to 1.0.

If your app runs using reverse-float Z and you reverse this (starting the march at 1 and going in the -Z direction), you're going to get a ton of precision artifacts.  The reason: our march has the lowest precision at the start of the march.  In reverse-float Z there's a lot of hyper-Z (screen-space) depth "wasted" around the near clip plane - that's okay because it's the 'good' part of our float range, which gets better with distance. But in our case, it's going to make our ray testing a mess.

The technique is also presented tracing only near occluders and tracing only away from the camera - this is good if you view a far away mountain off a lake but not good if you view a building through a puddle from nearly straight down.

As it turns out, all of these techniques can be addressed with one restructure: parametrically tracing the ray using a parametric variable "t" instead of the actual Z buffer.

In this modification, the beginning of the ray cast is always at t = 0.0, and the end of the ray-cast (a full Z unit away) is always at t = 1.0, regardless of whether that is in the positive or negative Z direction - our unit is normalized so that its Z magnitude is either +1.0 or -1.0, which saves us a divide when intersecting with Z planes.

What does this solve?  A few things:

  1. The algorithm has high precision at the beginning of the trace because t has small magnitude - we want this for accurate tracing of close occluders and odd angles.
  2. The algorithm is fully general whether the ray is cast toward or away from the camera - no conditional code inside the search loop.  Lower 't' is always "sooner" in the ray cast.
  3. We can now march around an occluder if we can attribute a min and max depth to it, by seeing if the range of t within a search cell overlaps the range of t in our min-max Z buffer.  This is fully general for "in front" and "behind" a cast and for "toward" and "away".
I didn't see any mention of changing the parameterization in my web searching, nor did I see anyone else complaining about the precision loss from reverse-Z (it took me a good day or two to realize why my floating point was underperforming on that one) so I figured I'd put it out there.

Thursday, June 18, 2020

Second. Worst. Lock. Ever.

Four years ago I wrote a short post describing a dumb race condition in our reference counted art assets.

To recap the problem: X-Plane's art assets are immutable and reference counted, so they can be accessed lock-free (read-only) from rendering threads.

But X-Plane also has a lookup table from file path to art asset that is functionally a table of weak references; on creation of an art asset we use the table to find that we already loaded the art asset and bump its count.  On destruction we have to grab the table lock to clear out the entry.

So version one of the code, which was really really bad, looked like this:

void object::release()
{
  if(m_count.decrement() == 0)
  {
    // Race goes here
    RAII_lock (m_table_lock());
    m_table.erase(this->name);
    delete this;
  }
}
This code is bad because after we decrement our reference count, but before we lock the table, another thread can go in, lock the table, find our art asset, increment its reference count and unlock the table - this would be caused by an async load of the same art asset (in another thread) hitting the "fast path".  We then delete a non-zero-count object.

The fix at the time was this:

void object::release()
{
  RAII_lock (m_table_lock());
  if(m_count.decrement() == 0)
  {
    m_table.erase(this->name);
    delete this;
  }
}

Since the table is locked before the decrement, no one can get in and grab our object - we block out all other asset loaders before we decrement; if we hit zero reference count, we take out the sledge hammer and trash the object.

Correct But Stupid

The problem with this new design is that it holds the table lock across every release operation - even ones where there is no chance of actually releasing the object.

We hold the table lock during asset creation - the API contract for loaders is that you get back a valid* C++ object representing the art asset when the creation API returns, so this effectively means we have to hold the lock so that a second thread loading the same object can't return the partially constructed object being built by a first thread.  This means the lock isn't a spin lock - it can be held across disk access for tens of milliseconds.

Well, that's not good. What happens when you put your object into a C++ smart handle that retains and releases the reference count in the constructor/destructor?

The answer is: you end up calling release all over the place and are constantly grabbing the table lock for one atomic op, and sometimes you're going to get stuck because someone else is doing real loading.

The reason this is a total fail is: client code would not expect that simply moving around ownership of the reference would be a "slow" operation the way true allocation/deallocation is.  If you say "I release an art asset on the main thread and the sim glitched" I tell you you're an idiot. If you say "my vector resized and I locked the sim for 100 ms", that's not a good API.

Third Time's a Charm

The heart of the bug is that we eat the expensive table lock when we release regardless of whether we need it.  So here's take three:

void object::release()
{
void object::release()
{
  if(m_count.decrement() == 0)
  {
    RAII_lock (m_table_lock());
    // If someone beat us to the table lock, check and abort.
    if(m_count.load() > 0)
      return;
    m_table.erase(this->name);
    delete this;
  }
}

This is sort of like double-checked locking: we do an early first check of the count to optimize out the table lock when it is obvious we aren't the deleter (our reference count is greater than zero after we decrement).  Once we take the table lock, we then re-check that no one beat us into the table between the decrement and the lock, and if we are still okay, we delete.

The win here is that we only take the table lock in the case where we are very likely to deallocate - and client code should only be hitting that case if the client code is prepared to deallocate a resource, which is never fast. With this design, as long as resource deallocation (at the client level) is in the background with resource creation, we never hit the table lock from any critical rendering path or incidental book-keeping.

* With the Vulkan renderer we now have art assets that complete some of their loading asynchronously - this is more or less mandatory because DMA transfers to the GPU are always asynchronous. So the synchronous part of loading is establishing a C++ object functional enough to "do useful stuff with it."

We could start to erode this time by having more functionality be asynchronously available and less be always-guaranteed.  But in practice it's not a real problem because entire load operations on the sim itself are already in the background, so lowering load latency doesn't get us very much real win.

Thursday, April 16, 2020

Specular Hilites Have Their Own Depth

I just noticed this effect while I was brushing my teeth. The faucet in the sink is:

  • Chrome (or stainless steel) or some other reflective "metal" and
  • Reasonably glossy and
  • Kinda dirty.
As I looked down, I could see specular reflections from the lights above the mirror and I could see calcium deposits on the surface of the faucet itself.

And...then I noticed that they didn't have the same parallax.  Close one eye, then the other, and the relative placement of the specular hilites with regards to the calcium deposits changes.

In other words, the specular hilites are farther away than the surface they reflect in.

If you take a step back and think about this, it's completely logical. An image of yourself a mirror appears twice as far away as the mirror itself, and if you draw an optical diagram, this is not surprising.  Twice as much eye separation parallax is 'washed out' by distance on the full optical path as to the surface of the mirror.

Interestingly, X-Plane's VR system correctly simulates this. We do per-pixel lighting calculations separately on each eye using a camera origin (and thus transform stack and light vector) specific to each eye. When I first coded this, I thought it was odd that the lighting could be "inconsistent" between eyes, but it never looked bad or wrong.

Now I realized that that inconsistency is literally a depth queue.