Saturday, September 14, 2019

Hardware Performance: Old and New iPhones and PCs

I am looking at performance on our latest revision of X-Plane Mobile, and I've noticed that the performance bottlenecks have changed from past versions.

In the past, we have been limited by some of everything - vertex count, CPU-side costs (mostly in the non-graphics code, which is ported from desktop), and fill rate/shading costs. This time around, something has changed - performance is pretty much about shading costs, and that's the whole ballgame.

This is a huge relief! Reducing shading costs is a problem for which real-time graphics has a lot of good solutions. It's normally the problem, so pretty much all rendering techniques find ways to address it.

How did this shift happen? I think part of it is that Apple's mobile CPUs have just gotten really, really good. The iPhone X has a single-core GeekBench 4 score of 4245. By comparison, the 2019 iMac with an i5-8500 at 3 ghz gets 5187. That's just not a huge gap between a full-size gaming mid-range desktop computer and...a phone.

In particular, we've noticed that bottlenecks that used to show up only on mobile have more or less gone away. My take on this is that Apple's mobile chips can now cope with not-hand-tuned code as well Intel's can, e.g. less than perfect memory access patterns, data dependencies, etc. (A few years ago, code that used to be good enough that hand tuning wouldn't be a big win would need some TLC for mobile. It's a huge dev time win to have that not be the case anymore.)

If I can summarize performance advice for today's mobile devices in a single item, it would be: "don't blend." The devices easily have enough ALU and texture bandwidth to run complex algorithms (like full PBR) at high res as long as you're not shading the full screen multiple times. Because the GPUs are tiling, the GPU will eliminate virtually any triangles that are not visible in the final product as long as blending is off.

By comparison, on desktop GPUs we find that utilization is often the biggest problem - that is, the GPU has the physical ALU and bandwidth to over-draw the scene multiple times if blending is left on, but the frame is made up of multiple smaller draw calls, and the GPU can't keep the entire card fed given the higher frequency of small jobs.

(We did try simplifying the shading environment - a smaller number of shaders doing more work by moving what used to be shader changes into conditional logic on the GPU. It was not a win! I think the problem is that if any part of a batch is changed, it's hard for the GPU to keep a lot of work in flight, even if the batch is "less different" than before optimization. Since we couldn't get down to "these batches are identical, merge them" we ended up with similarly poor utilization and higher ALU costs while shading.)

Saturday, April 06, 2019

Keeping the Blue Side Up: Coordinate Conventions for OpenGL, Metal and Vulkan

OpenGL, Metal and Vulkan all have different ideas about which way is up - that is, where the origin is located and which way the Y axis goes for a framebuffer. This post explains the API differences and suggests a few ways to cope with them. I'm not going to cover the Z axis or Z-buffer here - perhaps that'll be a separate post.

Things We Can All Agree On

Let's start with some stuff that's the same for all three APIs: in all three APIs the origin of a framebuffer and the origin of a texture both represent the lowest byte in memory for the data backing that image. In other words, memory addressing starts at 0,0 and then increases as we go the right, then as we go to the next line.  Whether we are in texels in a texture or pixels in a framebuffer, this relationship holds up.

This means that your model's UV maps and textures will Just Work™ in all three APIs. When your artist puts 0,0 into that giant fighting robot's UV map, the intent is "the texels at the beginning of memory for that texture."  You can load the image into the API the same way on all platforms and the UV map will pull out the right texels and the robot will look shiny.

All three APIs also agree on the definition of a clockwise or counterclockwise polygon - this decision is made in framebuffer coordinates as a human would see it if presented to the screen.  This works out well - if your model robot is drawing the way you expect, the windings are the way your artist created them, and you can keep your front face definition consistent across APIs.

Refresher: Coordinate Systems

For the purpose of our APIs, we care about three coordinate systems:

  • Clip Coordinates: these are the coordinates that come out of your shader. It's often to think in terms of normalized device coordinates (NDC) - the post-clip, post-perspective divide coordinates - but you don't get to see them.
  • Framebuffer coordinates. These are the coordinates that are rasterized, after the NDC coordinates are transformed by the viewport transform.
  • Texture coordinates. These are the coordinates we feed into the samplers to read from textures. They're not that interesting because, per above, they work the same on all APIs.

OpenGL: Consistently Weird

OpenGL's conventions are different from approximately every other API ever, but at least they are self-consistent: every single origin in OpenGL is in the lower left corner of the image, so the +Y axis is always up. +Y is up in clip coordinates, NDC, and framebuffer coordinates.

What's weird about this is that every window manager ever uses +Y = down, so your OpenGL driver is prrrrrobably flipping the image for you when it sends it off to the compositor or whatever your OS has. But after 15+ years of writing OpenGL code, +Y=up now seems normal to me, and the consistency is nice. One rule works everywhere.*

In the OpenGL world, we render with the +Y axis being up, the high memory of the framebuffer is the top of the image, which is what the user sees, and if we render to texture, the higher texel coordinates are that top of the image, so everything is good. You basically can't mess this system up.

Metal: Up is Down and Down is Up

Metal's convention is to have +Y = up in clip coordinates (and NDC) but +Y = down in framebuffer coordinates, with the framebuffer origin in the upper left. While this is baffling to programmers coming from GL/GLES, it feels familiar to Direct3d programmers.  In Metal, the viewport transformation has a built-in Y flip that you can't control at the API level.

The window manager presents Metal framebuffers with the lowest byte in the upper left, so if you go in with a model that transforms with +Y = up (OpenGL style), your image will come out right side up and all is good. But be warned, chaos lurks beneath the surface.

Metal's viewport and scissors are defined in framebuffer coordinates, so they now run +Y=down, and will require parameter adjustment to match OpenGL.

Note also that our screenshot code that reads back the framebuffer will have to run differently on OpenGL and Metal; one (depending on your output image file format) will require an image flip, and one will not.

Render-to-Texture: Two Wrongs Make a "Ship It"

Here's the problem with Metal: let's say we draw a nice scene with blue sky up top and green grass on the bottom. We're going to use it as an environment input and sample it. Our OpenGL code expects that low texture coordinates (near 0) get us the green grass at the bottom of memory and high texture coordinates (near 1) get us the blue sky at the top of memory.

Unfortunately in the render-to-texture case, Metal's upper-left origin has been applied - the sky is now in low memory and the grass is in high memory, and our code that samples this image will show something upside-down and probably quite silly looking.

We have two options:
  1. Adjust the image at creation time by hacking the transform matrix or
  2. Adjusting the code that uses the image by adjusting the sampling coordinates.

For X-Plane, we picked door number 1 - intentionally render the image upside down (by Metal standards, or "the way it was meant to be" by OpenGL standards) so that the image is oriented as the samplers expect.

Why do it this way? Well, in our case, we often have shaders that sample both from images on disk and from rendered textures; if we flip our textures on disk (to match Metal's default framebuffer orientation) then we have to adjust every UV map that references a disk image, and that's a huge amount of code, because it covers all shaders and C++ code that generate UV maps. Focusing on render-to-texture is a smaller surface area to attack.

For Metal, we need to intentionally flip the Y coordinate by applying a Y-reverse to our transform stack - in our case this also meant ensuring that every shader used the transform stack; we had a few that were skipping it and had to be set up with identity transforms so the low level code could slip in the inversion.

We also need to change our front face's winding order, because winding orders are labeled in the API based on what a human would see if the image is presented to the screen. By mirroring our image to be upside down, we've also inverted all of our models' triangle windings, so we need to change our definition of what is in front.

Sampling with the gl_FragCoord or [[position]]: Three Wrongs Make a "How Did I Get Here"?

There's one more loose end with Metal: if you wrote a shader that uses gl_FragCoord to reconstruct some kind of coordinates based on the window rasterization position, they're going to be upside-down from what your shader did in OpenGL.  The upper left of your framebuffer will rasterize with 0,0 for its position, and if you pass this on to a texture sampler, you're going to pick off low memory.

Had we left well enough alone, this would have been fine, as Metal wants to put the upper left of an image in low memory when rasterizing. But since we intentionally flipped things, we're now...upside down again.

Here we have two options:

  1. Don't actually flip the framebuffer when rendering to texture. Maybe that was a dumb idea.
  2. Insert code to flip the window coordinates.
For X-Plane we do both: some render targets are intentionally rasterized at API orientation (and not X-Plane's canonical lower-left-origin orientation) specifically so they can be resampled using window positions.  For example, we render a buffer that is sampled to get per-pixel fog, and we leave it at API orientation to get correct fogging.

Flipping the window coordinate in the sampled code makes sense when the window position is going to be used to reconstitute some kind of world-space coordinate system.  Our skydome, for example, is drawn as a full screen quad that calculates the ray that would project through the point in question. It takes as inputs the four corners of the view frustum, and swapping those in C++ fixes our sampling to match our upside down^H^H^H^H^H^HOpenGL-and-perfect-just-the-way-it-is image.

What Have We Learned (About Metal)

So to summarize, with Metal:

  • If we're going to render to a texture for a model, we put a Y-flip into our transform stack and swap our front face winding direction.
  • If we're going to render to a texture for sampling via window coordinates, we don't.
  • If we're going to use window coordinates to reconstruct 3-d, we have to swap the reconstruction coefficients.

Vulkan: What Would Spock Do?

Apparently: headstands! Vulkan's default coordinate orientation is +Y=down, full stop. The upper left of the framebuffer is the origin, and there's no inversion of the Y axis. This is consistent, but it's also consistently different from every other 3-d API ever, in that the Y axis in clip coordinates is backward from OpenGL, Metal, and DX.

The good news is: with Vulkan 1.1 you can specify a negative viewport height, which gives you a Y axis swap. With this trick, Vulkan maches DX and Metal, and all you have to worry about it is all of the craziness listed above.

* a side effect of this is: when I built our table UI component, I defined the bottom-most row of the table as row 0. My co-workers incorrectly think this is a weird convention for UI, and one of them went as far as to write a table row flipping function called
The moral of the story is that +Y=up is both consistent and a great way to get out of being asked to maintain old and poorly thought out UI widgets.  I trust my co-worker will come around to my way of thinking in another fifteen years.

Wednesday, November 21, 2018

Code like You

Fantastic lightning talk from this year's cppcon:

This was what I was trying to get at in advocating for not going to the Nth degree with C++ complexity. We have finite cognitive budgets for detail, and the budget isn't super generous, so maybe don't burn it on Boost?

Saturday, August 11, 2018

Solve Less General Problems

Two decades ago when I first started working at Avid, one of the tasks I was assigned was porting our product (a consumer video editor - think iMovie before it was cool) from the PCI-card-based video capture we first shipped with to digital video vie 1394/Firewire.

Being the fresh-out-of-school programmer was, I looked at this and said "what we need is a hardware abstraction layer!" I dutifully designed and wrote a HAL and parameterized more or less everything so that the product could potentially use any video input source we could come up with a plugin for.

This seemed at the time like really good design, and it did get the job done - we finished DV support.

After we shipped DV support, the product was canceled, I was moved to a different group, and the HAL was never used again.

In case it is not obvious from this story:

  • The decision to build a HAL was a totally stupid one. There was no indication in any of the product road maps that we had the legs to do a lot of video formats.
  • The fully generalized HAL design had a much larger scope than parameterizing only the stuff that actually had to change for DV.
  • We never were able to leverage any of the theoretical upsides of generalizing the problem.
  • I'm pretty embarrassed by the entire thing - especially the part where I told my engineering manager about how great this was going to be.
I would add to this that had the product not been canned, I'd bet a good bottle of scotch that the next hardware option that would have come along probably would have broken the abstraction (based only on the data points of PCI and DV video) and we would have had to rewrite the HAL anyway.

There's been plenty of written by lots of software developers about not 'future-proofing' a design speculatively. The short version is that it's more valuable to have a smaller design that's easy to refactor than to have a larger design with abstractions that you don't use; the abstractions are a maintenance tax.

It's Okay To Be Less General

One way I view my growth as a programmer over the last two decades is by tracking my becoming okay with being less general. At the time I wrote the HAL, if someone more senior had told me "just go special-case DV", I almost certainly would have explained how this was terrible design, and probably have gone and pouted about it if required to do the fast thing. I certainly wouldn't have appreciated the value to the business of getting the feature done in a fraction of the time.

In my next model I started learning from the school of hard knocks. I started with a templated data model ("hey, I'm going to reuse this and it'll be glorious") and about part way through recognized that I was being killed by an abstraction tax that wasn't paying me back. (At the time templates tended to crash the compiler, so going fully templated was really expensive.)  I made the right decision, after trying all of the other ones first - very American.

Being Less General Makes the Problem Solvable

I wrote about this previously, but Fedor Pikus is pretty much saying the same thing - in the very hard problem of lock-free programming, a fully general design might be impossible. Better to do something more specific to your design and have it actually work.

Here's another way to put this: every solution has strengths and weaknesses. You're better off with a solution where the weaknesses are the part of the solution you don't need.

Don't Solve Every Problem

Turns out Mike Acton is kind of saying the same thing. The mantra of the Data-Oriented-Design nerds is "know your data". The idea here is to solve the specific problem that your specific data presents. Don't come up with a general solution that works for your data and other data that your program will literally never see. General solutions are more expensive to develop and probably have down-sides you don't need to pay for.

Better to Not Leak

I haven't had a stupid pithy quote on the blog in a while, so here's some parental wisdom: it's better not to leak.
Prefer specific solutions that don't leak to general but leaky abstractions.
It can be hard to make a really general non-leaky abstraction. Better to solve a more specific problem and plug the leaks in the areas that really matter.

Wednesday, August 08, 2018

When Not To Serialize

Over a decade ago, I wrote a blog post about WorldEditor's file format design and, in particular, why I didn't reuse the serialization code from the undo system to write objects out to disk. The TL;DR version is that the undo system is a straight serialization of the in-memory objects, and I didn't want to tie the permanent file format on disk to the in-memory data model.

That was a good design decision. I have no regrets! The only problem is: the whole premise of the post is quite misleading because:

While WorldEditor does not use its in memory format as a direct representation of objects, it absolutely does use its in-memory linkage between objects to persist higher level data structures. And this turns out to be just as bad.

What Not To Do

WorldEditor's data model works more or less like this (simplified):

  • A document is made up of ... um ... things.
  • Every thing has an optional parent and zero or more ordered children, referred to by ID.
  • Higher level structures are made by trees of things.
For example, a polygon is a thing that has its contours as children (the first contour is the outer contour and the subsequent ones are holes). Contours, in turn, are things that have vertices as children, defining their path.

In a WorldEditor document, a taxiway is a cluster of things with various IDs; to rebuild the full geometric information of a taxiway (which is a polygon) you need to use the parent-child relationships and look up the contours and vertices.

For WorldEditor, the in memory representation is exactly this schema, so the cost of building our polygons in our document is zero. We just build our objects and go home happy.

This seems like a win!  Until...

Changing the Data Structures

As it turns out, polygon has contours has vertices is a poor choice for an in-memory model of polygons. The big bug is: where are the edges??? In this model, edges are implicit - every pair of vertices defines one.

Things get ugly when we want to select edges.  WorldEditor's selection model is based on an arbitrary set of selected things. But this means that if it's not a thing, it can't be selected. Ergo: we can't select edges. This in turn makes the UI counter-intuitive. We have to go to bind-bending levels of awkwardness to pretend edges are selected when they are not.

The obvious thing to do would be to just add edges: introduce a new edge object that references its vertices, let the vertices reference adjacent edges, and go home happy.

This change would be relatively straight forward...until we go to load an old document and all of the edges are missing.

The Cost of Serializing Data Structures

The fail here is that we've serialized our data structures. And this means we have to parse legacy files in terms of those data structures to understand an old file at all. Let's look at all of the fail. To load an old file post-refactor we need to either:

  • Keep separate code around that can rebuild the old file structures into memory in their old form, so that we can then migrate those old in-memory structures into new ones. That's potentially a lot of old code that we probably hate - we wouldn't have rewritten it into a radically different form if we liked it.*
  • Alternatively, we can create a new data model that can exist with both the layout of the old and new data design. E.g. we can say that edges are optional and then "upgrade" the data model by adding them in when missing. But this sucks because it adds a lot of requirements to an in-memory data model that should probably be focused on performance and correctness.
And of coarse, the old file format you're dealing with was never designed - it's just whatever you had in memory dumped out. That's not going to be a ton of fun to parse in the future.

When Not To Serialize

The moral equivalent of this problem (using the container structures that bind together objects as a file format spec) is dumping your data structures directly into a serializer (e.g. boost::serialize or some other non-brain-damaged serialization system) and calling it a "file format".

To be clear: serializing your data structures is totally fine as long as file format stability over time is not a design goal. So for example, for undo state in WorldEditor this isn't a problem at all - undoes exist only per app run and don't have to interoperate between any instance of the app (let alone ones with code changes).

But if you need a file format that you will be able to continue to read after changing your code, serializing your containers is a poor choice, because the only way to read back the old data into the new code will be to create a shadow version of your data model (using those old containers) to get the data back, then migrate in memory.

Pay Now or Pay Later

My view is: writing code to translate your in-memory data structure from its native memory format to a persistent on disk format is a feature, not a bug: that translation code provides the flexibility to allow your in-memory and on disk data layouts change independently - when you need to change one and not the other, you can add logic to the translator. Serialization code (and the more automagic, the more so) binds these things together tightly. This is a problem when the file format and the in-memory format have different design goals, e.g. data longevity vs. performance tuning.

If you don't write that translation layer for the file format version 1, you'll have to write it for the file format version 2, and the thing you'll be translating won't be designed for longevity and sanity when parsing.

* We had to do this when migrating X-Plane 9's old binary file format (which was a memcpy of the actual airplane in memory) into X-Plane 10.  X-Plane 10 kept around a straight copy of all of the C structs, fread() them into place, and then copied the data out field by field. Since we moved to a key-value pair schema in X-Plane 10, things have been much easier.

Wednesday, June 06, 2018

Hats Off for a Fast Turn-Around

Normally I use this blog to complain about things that are broken, but I want to give credit here to Apple, for their WWDC 2018 video turn-around time. The first Metal session ended at 9 pm EDT and at 9:30 AM the next day it's already available for download. That's an incredible turn-around time for produced video, and it shows real commitment to making WWDC be for everyone and not just the attendees - last year we were at ~ 24 hour turn-around and it would have been easy to say "good enough" and take a pat on the back. My thanks to the team that had to stay up last night making this happen.

Monday, May 21, 2018

Never Map Again: Persistent Memory, System Memory, and Nothing In Between

I have, over the years, written more posts on VBOs and vertex performance in OpenGL than I'd like to admit. At this point, I can't even find them all. Vertex performance is often critical in X-Plane because we draw a lot of stuff in the world; at altitude you can see a lot of little things, and it's useful to be able to just blast all of the geometry through, if we can find a high performance vertex path.

It's 2018 and we've been rebuilding our internal engine around abstractions that can work on a modern GL driver, Vulkan and Metal. When it comes to streaming geometry, here's what I have found.

Be Persistent!

First, use persistent memory if you have it. On modern GL 4.x drivers on Windows/Linux, the driver can permanently map buffers for streaming via GL_ARB_buffer_storage. This is plan A! This will be the fastest path you can find because you pay no overhead for streaming geometry - you just write the data. (It's also multi-core friendly because you can grab a region of mapped memory without having to talk to the driver at all, avoiding multi-context hell.)

That persistent memory is a win is unsurprising - you can't get any faster than not doing any work at all, and persistent memory simply removes the driver from the equation by giving you a direct memory-centric way to talk to the GPU.

Don't Be Uncool

Second, if you don't have persistent memory (e.g. you are on OS X), use system memory via client arrays, rather than trying to jam your data into a VBO with glMapBuffer or glBufferSubData.

This second result surprised me, but in every test I've run, client arrays in system memory have out-performed VBOs for small-to-medium sized batches. We were already using system memory for small-batch vertex drawing, but it's even faster for larger buffers.

Now before you go and delete all your VBO code, a few caveats:

  • We are mostly testing small-batch draw performance - this is UI, some effects code, but not million-VBO terrain chunks.
  • The largest streaming data I have tried is a 128K index buffer. That's not tiny - that's perhaps 32 VM pages, but it's not a 2 MB static mesh.
  • It wouldn't shock me if index buffers are more friendly to system memory streaming than vertex buffers - the 128K index buffer indexes a static VBO.

Why Would Client Arrays Be Fast?

I'd speculate, they're easier to optimize.

Unlike VBOs, in the case of client arrays, the driver knows everything about the data transfer at one time. Everything up until an actual draw call is just stashing pointers for later use - the app is required to make sure the pointers remain valid until the draw call happens.

When the draw call happens, the driver knows:

  • How big the data is.
  • What format the data is in.
  • Which part of the data is actually consumed by the shader.
  • Where the data is located (system memory, duh).
  • That this is a streaming case - since the API provides no mechanism for efficient reuse, the driver might as well assume no reuse.
There's not really any validation to be done - if your client pointers point to junk memory, the driver can just segfault.

Because the driver knows how big the draw call is at the time it manages the vertex data, it can select the optimal vertex transfer mode for the particular hardware and draw call size. Large draws can be scheduled via a DMA (worth it if enough data is being transferred), medium draws can be sourced right from AGP memory, and tiny draws could even be stored directly in the command buffer.

You Are Out of Order

There's one last thing we know for client arrays that we don't know for map/unmap, and I think this might be the most important one of all: in the case of client arrays, vertex transfer is strictly FIFO - within a single context (and client arrays data is not shared) submission order from the client is draw/retirement order.

That means the driver can use a simple ring buffer to allocate memory for these draw calls. That's really cheap unless the total size of the ring buffer has to grow.

By comparison, the driver can assume nothing about orphaning and renaming of VBOs. Rename/map/unmap/draw sequences show up as ad hoc calls to the driver, so the driver has to allocate new backing storage for VBOs out of a free store/heap. Even if the driver has a magazine front-end, the cost of heap allocations in the driver is going to be more expensive than bumping ring buffer pointers.

What Can We Do With This Knowledge?

Once we recognize that we're going to draw only with client arrays and persistent memory (and not with non-persistent mapped and unmapped VBOs), we can recognize a simplifying assumption: our unmap/flushing overhead is zero in every case, and we can simplify client code around this.

In a previous post, I suggested two ways around the cost of ensuring that your data is GPU-visible: persistent memory and deferring all command encoding until later.

If we're not going to have to unmap, we can just go with option 1 all of the time. If we don't have persistent coherent memory, we treat system memory as our persistent coherent memory and draw with client arrays. This means we can drop the cost of buffering up and replaying our command encoding and just render directly.