Tuesday, September 20, 2016

Finding Slow Frames With Instruments 8.0

There are two kinds of programmers: those that download a brand new version of X-Code on the very first day, and those that laugh at them when being an early adopter doesn't go well for them.

I am definitively in that second category; my view is that most software has gotten "good enough", so when a new version comes out, you pay a lot of disruption costs for not a lot of benefits. It's been years since a major OS update has shipped a feature I actually care about.

Yet the last two versions of X-Code actually have delivered new features compelling enough to make a tool-chain upgrade worth it. X-Code 7 delivered Address Sanitizer, which was a huge step forward in catching memory scribbles.

X-Code 8 has Thread Sanitizer, which I will blog about shortly, but this post is show and tell from the new version of Instruments, which finally has a clean way to annotate your timeline.*

Finding Dropped Frames

Adaptive Sampling Profilers are spectacular for finding performance problems, because by the definition of their sampling, they are better at finding the code that's taking the most CPU time. If you want to focus your effort on performance problems that really matter, an adaptive sampling profile will tell you where to aim.

There's one case though where an adaptive sampling profiler falls on its face: if you need to find a single "slow frame" in a sea of fast frames.

Imagine we have 300 consecutive frames, each taking about 15 ms. In other words, our game is running solidly at 60 fps.

Then we hit a "slow" frame - some part of the frame takes 60 ms (!), making that frame 4x as slow as the other ones. This is a visible "hitch" to the user and a bug.

But here's the problem: we have 4500 ms of good frames and 60 ms of bad frames. The bad frame represents 1.3% of the sample data. Even if the bad frame is made up of one single 45 ms function call that pushed us past our budget, that function is < 1% of the sample data and is going to be buried in the noise of non-critical every-frame functions (like the 1% we spent on the overlay UI).

Mark My Words

What we need is some kind of marker, showing us where in the timeline the bad frame occurred. Instruments 8 adds this with the "points of interest" track, which is basically user-data where you can mark what the app was doing.

Points of interest is part of the system trace; the system trace is a fairly heavy instrument, so unfortunately you have to use it in Windowed mode - that is, tell it to record the last N seconds, then bail out within N seconds of something happening. Fortunately on my laptop I can run with a 15 second window and the high frequency timed profiler and the machine is okay (barely). I would describe this performance as "usable" and not "amazing" - but then the Instruments team probably has access to some performance tools to make Instruments faster.

The newest Apple OSes have a framework to put debug markers into your app, but fortunately if you need to run on older tech, Apple has documented the secret spell to get a marker in legacy OS:

#if APL

syscall(SYS_kdebug_trace, APPSDBG_CODE(DBG_MACH_CHUD,0) | DBG_FUNC_NONE,0,0,0,0);
That puts a single marker into the timeline, and it is what I used to see my frame boundaries.

Here's a zoom-in of my timeline - the S in a circle are the frame boundaries - the short spans are solid fps and the long spans are the drops.
The middle blue track is VM activity - as you can see, the VM manager goes haywire every time we glitch.

I have an SSD and all I was doing was circling around a parked plane, so this was really weird. I took a look at the back-traces around the VM activity and the GLSL shader compiler was being run.

You Know That's Really Stupid, Right?

One of the purposes of this blog is to act as a confessional for dumb things I have coded; I try to bury these at the bottom of the post so no one sees them.

In this case, X-Plane has an inherited legacy that haunts us: on-the-fly shader compile. I will write another blog post about OpenGL and Vulkan and why this will never really be fixed until we go to a modern API that promises to not compile on the fly behind our backs, but for now the thing to note is that as you fly, X-Plane sometimes discovers that it hasn't compiled the terrain shader with the right options for the particular scenario it hits, which is a function of time of day and weather, the particular scenery, the rendering pass, and user rendering settings.

Here we can do even better with the points of interest: we can mark regions of interest like this:

syscall(SYS_kdebug_trace, APPSDBG_CODE(DBG_MACH_CHUD,1) | DBG_FUNC_START,0,0,0,0);
/* do really slow stuff */
syscall(SYS_kdebug_trace, APPSDBG_CODE(DBG_MACH_CHUD,1) | DBG_FUNC_END,0,0,0,0);
In the picture above, you can see the red bars at the very top - these mark the in-frame shader compiles, and they exactly fit the 'extra space' in the long frames. Problem child found!

What Was That?!?!

I did hit one dropped frame that did not have a shader compile in the middle - the sim went totally crazy with VM activity, and yet the time profile showed nothing unusual. It didn't make any sense.

So I switched to the "core" view, where the system trace shows you what each CPU is doing. I have an i7, so we can see 8 virtual cores. New to Instruments 8, the orange sections are where there are more runnable threads at reasonably high priority than there are cores - in other words, performance is degraded by running out of core count.
The play-head annotates what process is running over any given section, and the amazing thing about this trace is that as you scrub over all of those tiny blocks, it's almost never the same process twice. What happened?

Chris managed to psychically debug this: X-Plane dropped frames when I hit command-tab and switched out of X-Plane and back to Instruments to stop recording. Apparently every service in the universe on my machine is programmed to briefly wake up and reconsider its life choices when the foreground application changes.

The good news is that this kind of crazy activity was not at all present in the regular operation of the sim, and "sim drops frames when wildly changing between other apps" is not a bug I have to fix.

* Timeline markers in a profiler are not revolutionary at all. VTune has had this for at least longer than I have been using VTune. But even if a tool-chain feature is not new to the universe, it is still very useful when it is new to a platform if you are using that platform. VTune's ability to spot long frames two years ago was of no help to our iPhone app!

Thursday, September 08, 2016

You Can Never Have a Custom Z Buffer Distribution and Early Z at the Same Time

The distribution of precision in the default Z buffer configuration is not ideal. Precision is distributed via a 1 / x curve. This means more precision close to the camera, and less precision far away from it (which is good) but the curve goes asymptotic near the front, so it uses up too much precision. If you've had to try to tune your near clip plane, you know that bringing that near clip plane in just a little bit makes things much worse in the distance.

In order to get correct perspective rendering via a view frustum, our projection matrix has to put our eye-space Z coordinate into our clip space W coordinate. But we don't get to pick and choose - perspective divide is part of fixed function hardware, and all of X, Y and Z in clip space are divided by the same W. So if we are dividing by our eye-space Z for our screen location, we have to do so for depth as well. This is what distributes our depth precision.

We Have Shaders - We Can Do Better

Now that we have shaders, we have the power to mess with this scheme. Let's say we want our Z buffer to contain an eye-space linear distribution. No problem! At the end of our vertex shader, we just load whatever Z we want, then multiply by our clip-space W to pre-nullify the perspective divide. Perfect!

You can use this to have a logarithmic Z buffer, or you can use an eye-space Z buffer in floating point and in both cases you get really great distribution of depth precision. You can have a really, really close near clip plane and still have a huge depth range.

The Problem With Not Being Linear

All of these custom Z distributions have one thing in common: the clip space Z values distributed across a triangle are non-linear. That is, if we find a point half-way along the edge of a triangle, the Z value that our formula produces will not be the same as the Z value the hardware interpolator produces based on the two triangle corners.

When are you non-linear? Pretty much any time you do anything to your Z value that can't be described as Ax + By + Cz + D.  So for example, for eye-space linear, if we need to multiply by eye-space Z again, we're not linear.

The problem with not being linear is that the Z buffers written by the middle of a triangle may be very different from where it would have been if the triangle was tessellated. If you have a large-ish terrain triangle with a small decal like a rock on top of it, the middle of the large triangle might interpolate to be too close (or too far), causing Z errors.  If you have a mesh with layers that requires hidden surface removal and it isn't extremely uniform in vertex density, again, interior interpolation error can lead to Z artifacts. Any polygon that intersects the clip planes is also going to go somewhat haywire, as the clipping changes Z mid-triangle while the camera moves.

Let's Try The Fragment Shader

The above problems are almost certainly a show-stopper for any engine that shows artist-made 3-d meshes. The solution is to move the custom Z calculation to the fragment shader, so it can be done per pixel. This works great from a correctness standpoint - every pixel results in the right Z according to your encoding, but it has a nasty side effect: you lose early-Z.

Early Z is an optimization where your GPU runs the depth-stencil test before the fragment shader, instead of after. This is a huge win for any scene with significant hidden surfaces; fragment shading costs for hidden surfaces are completely removed, because the hardware doesn't even dispatch the fragments for shading. Without Early Z, your fragment shader runs, fetching textures, burning bandwidth, consuming ALU, and then throws the result out at the very end when it turns out that, oh noes, you've been drawing a secret base that's behind a closed door.

In order for early-Z to be effective, the Z coordinate out of the rasterizer has to be correct - that is, the fragment shader can't modify it. So any scheme that encodes a non-linear depth buffer in fragment shader defeats this.

The really bad news is: you don't just pay once. Now that your Z buffer is encoded with a special encoding, anything that is Z tested (even if it isn't Z-writing) has to calculate Z in the fragment shader as well. So for example, particle systems, clouds, and other fill-rate-intensive effects become more expensive.

In summary: the need to move a custom Z function to the fragment shader is caused by the very thing that gives it better depth distribution, and this defeats early Z. So you can never have early Z and a super depth buffer distribution at the same time.

There is one work-around I have seen that gets around this: use floating point depth and reverse the near and far encoding values; since floating point has more precision at 0.0, you are using the higher precision of float where we need it (where 1/Z is imprecise).

Split Depth for Now

X-Plane, like many games, uses two separate Z environments to cope with a very close view (inside the cockpit, where millimeters matter) and the world (where you can be millions of meters from the horizon). This isn't ideal - the cost of separating the Z passes isn't zero.

I wrote this post up after having worked out the math behind custom Z encodings (log Z, floating point eye-space, etc.); I had a todo item to investigate whether we could move X-Plane to a single Z pass and save some overhead.

The answer is, unfortunately, no for now. Losing early Z on our particle and cloud effects is a non-starter; we'd need to use a floating point depth buffer. Unfortunately, ARB_clip_control isn't wide-spread enough for us to count on. We'd also eat bandwidth in moving from a D24_S8 integer depth buffer to a D32F_S8 depth buffer (which pads out to 64 bits per depth sample.

One last note: X-Plane uses a floating point channel of the depth buffer to write floating point depth in eye space. While we split the actual Z buffer and draw in two passes, we build a single unified G-Buffer; our eye-space linear floating point depth has enough range to span both passes, and lets us then shade our G-Buffer in a single pass.

(When we run our two depth passes, we change only the near and far clip planes and nothing else; this means we can reuse our shadow maps and share our G-Buffer.  This takes most of the cost out of our two passes.)