Wednesday, November 21, 2018

Code like You

Fantastic lightning talk from this year's cppcon:

This was what I was trying to get at in advocating for not going to the Nth degree with C++ complexity. We have finite cognitive budgets for detail, and the budget isn't super generous, so maybe don't burn it on Boost?

Saturday, August 11, 2018

Solve Less General Problems

Two decades ago when I first started working at Avid, one of the tasks I was assigned was porting our product (a consumer video editor - think iMovie before it was cool) from the PCI-card-based video capture we first shipped with to digital video vie 1394/Firewire.

Being the fresh-out-of-school programmer was, I looked at this and said "what we need is a hardware abstraction layer!" I dutifully designed and wrote a HAL and parameterized more or less everything so that the product could potentially use any video input source we could come up with a plugin for.

This seemed at the time like really good design, and it did get the job done - we finished DV support.

After we shipped DV support, the product was canceled, I was moved to a different group, and the HAL was never used again.

In case it is not obvious from this story:

  • The decision to build a HAL was a totally stupid one. There was no indication in any of the product road maps that we had the legs to do a lot of video formats.
  • The fully generalized HAL design had a much larger scope than parameterizing only the stuff that actually had to change for DV.
  • We never were able to leverage any of the theoretical upsides of generalizing the problem.
  • I'm pretty embarrassed by the entire thing - especially the part where I told my engineering manager about how great this was going to be.
I would add to this that had the product not been canned, I'd bet a good bottle of scotch that the next hardware option that would have come along probably would have broken the abstraction (based only on the data points of PCI and DV video) and we would have had to rewrite the HAL anyway.

There's been plenty of written by lots of software developers about not 'future-proofing' a design speculatively. The short version is that it's more valuable to have a smaller design that's easy to refactor than to have a larger design with abstractions that you don't use; the abstractions are a maintenance tax.

It's Okay To Be Less General

One way I view my growth as a programmer over the last two decades is by tracking my becoming okay with being less general. At the time I wrote the HAL, if someone more senior had told me "just go special-case DV", I almost certainly would have explained how this was terrible design, and probably have gone and pouted about it if required to do the fast thing. I certainly wouldn't have appreciated the value to the business of getting the feature done in a fraction of the time.

In my next model I started learning from the school of hard knocks. I started with a templated data model ("hey, I'm going to reuse this and it'll be glorious") and about part way through recognized that I was being killed by an abstraction tax that wasn't paying me back. (At the time templates tended to crash the compiler, so going fully templated was really expensive.)  I made the right decision, after trying all of the other ones first - very American.

Being Less General Makes the Problem Solvable

I wrote about this previously, but Fedor Pikus is pretty much saying the same thing - in the very hard problem of lock-free programming, a fully general design might be impossible. Better to do something more specific to your design and have it actually work.

Here's another way to put this: every solution has strengths and weaknesses. You're better off with a solution where the weaknesses are the part of the solution you don't need.

Don't Solve Every Problem

Turns out Mike Acton is kind of saying the same thing. The mantra of the Data-Oriented-Design nerds is "know your data". The idea here is to solve the specific problem that your specific data presents. Don't come up with a general solution that works for your data and other data that your program will literally never see. General solutions are more expensive to develop and probably have down-sides you don't need to pay for.

Better to Not Leak

I haven't had a stupid pithy quote on the blog in a while, so here's some parental wisdom: it's better not to leak.
Prefer specific solutions that don't leak to general but leaky abstractions.
It can be hard to make a really general non-leaky abstraction. Better to solve a more specific problem and plug the leaks in the areas that really matter.

Wednesday, August 08, 2018

When Not To Serialize

Over a decade ago, I wrote a blog post about WorldEditor's file format design and, in particular, why I didn't reuse the serialization code from the undo system to write objects out to disk. The TL;DR version is that the undo system is a straight serialization of the in-memory objects, and I didn't want to tie the permanent file format on disk to the in-memory data model.

That was a good design decision. I have no regrets! The only problem is: the whole premise of the post is quite misleading because:

While WorldEditor does not use its in memory format as a direct representation of objects, it absolutely does use its in-memory linkage between objects to persist higher level data structures. And this turns out to be just as bad.

What Not To Do

WorldEditor's data model works more or less like this (simplified):

  • A document is made up of ... um ... things.
  • Every thing has an optional parent and zero or more ordered children, referred to by ID.
  • Higher level structures are made by trees of things.
For example, a polygon is a thing that has its contours as children (the first contour is the outer contour and the subsequent ones are holes). Contours, in turn, are things that have vertices as children, defining their path.

In a WorldEditor document, a taxiway is a cluster of things with various IDs; to rebuild the full geometric information of a taxiway (which is a polygon) you need to use the parent-child relationships and look up the contours and vertices.

For WorldEditor, the in memory representation is exactly this schema, so the cost of building our polygons in our document is zero. We just build our objects and go home happy.

This seems like a win!  Until...

Changing the Data Structures

As it turns out, polygon has contours has vertices is a poor choice for an in-memory model of polygons. The big bug is: where are the edges??? In this model, edges are implicit - every pair of vertices defines one.

Things get ugly when we want to select edges.  WorldEditor's selection model is based on an arbitrary set of selected things. But this means that if it's not a thing, it can't be selected. Ergo: we can't select edges. This in turn makes the UI counter-intuitive. We have to go to bind-bending levels of awkwardness to pretend edges are selected when they are not.

The obvious thing to do would be to just add edges: introduce a new edge object that references its vertices, let the vertices reference adjacent edges, and go home happy.

This change would be relatively straight forward...until we go to load an old document and all of the edges are missing.

The Cost of Serializing Data Structures

The fail here is that we've serialized our data structures. And this means we have to parse legacy files in terms of those data structures to understand an old file at all. Let's look at all of the fail. To load an old file post-refactor we need to either:

  • Keep separate code around that can rebuild the old file structures into memory in their old form, so that we can then migrate those old in-memory structures into new ones. That's potentially a lot of old code that we probably hate - we wouldn't have rewritten it into a radically different form if we liked it.*
  • Alternatively, we can create a new data model that can exist with both the layout of the old and new data design. E.g. we can say that edges are optional and then "upgrade" the data model by adding them in when missing. But this sucks because it adds a lot of requirements to an in-memory data model that should probably be focused on performance and correctness.
And of coarse, the old file format you're dealing with was never designed - it's just whatever you had in memory dumped out. That's not going to be a ton of fun to parse in the future.

When Not To Serialize

The moral equivalent of this problem (using the container structures that bind together objects as a file format spec) is dumping your data structures directly into a serializer (e.g. boost::serialize or some other non-brain-damaged serialization system) and calling it a "file format".

To be clear: serializing your data structures is totally fine as long as file format stability over time is not a design goal. So for example, for undo state in WorldEditor this isn't a problem at all - undoes exist only per app run and don't have to interoperate between any instance of the app (let alone ones with code changes).

But if you need a file format that you will be able to continue to read after changing your code, serializing your containers is a poor choice, because the only way to read back the old data into the new code will be to create a shadow version of your data model (using those old containers) to get the data back, then migrate in memory.

Pay Now or Pay Later

My view is: writing code to translate your in-memory data structure from its native memory format to a persistent on disk format is a feature, not a bug: that translation code provides the flexibility to allow your in-memory and on disk data layouts change independently - when you need to change one and not the other, you can add logic to the translator. Serialization code (and the more automagic, the more so) binds these things together tightly. This is a problem when the file format and the in-memory format have different design goals, e.g. data longevity vs. performance tuning.

If you don't write that translation layer for the file format version 1, you'll have to write it for the file format version 2, and the thing you'll be translating won't be designed for longevity and sanity when parsing.

* We had to do this when migrating X-Plane 9's old binary file format (which was a memcpy of the actual airplane in memory) into X-Plane 10.  X-Plane 10 kept around a straight copy of all of the C structs, fread() them into place, and then copied the data out field by field. Since we moved to a key-value pair schema in X-Plane 10, things have been much easier.

Wednesday, June 06, 2018

Hats Off for a Fast Turn-Around

Normally I use this blog to complain about things that are broken, but I want to give credit here to Apple, for their WWDC 2018 video turn-around time. The first Metal session ended at 9 pm EDT and at 9:30 AM the next day it's already available for download. That's an incredible turn-around time for produced video, and it shows real commitment to making WWDC be for everyone and not just the attendees - last year we were at ~ 24 hour turn-around and it would have been easy to say "good enough" and take a pat on the back. My thanks to the team that had to stay up last night making this happen.

Monday, May 21, 2018

Never Map Again: Persistent Memory, System Memory, and Nothing In Between

I have, over the years, written more posts on VBOs and vertex performance in OpenGL than I'd like to admit. At this point, I can't even find them all. Vertex performance is often critical in X-Plane because we draw a lot of stuff in the world; at altitude you can see a lot of little things, and it's useful to be able to just blast all of the geometry through, if we can find a high performance vertex path.

It's 2018 and we've been rebuilding our internal engine around abstractions that can work on a modern GL driver, Vulkan and Metal. When it comes to streaming geometry, here's what I have found.

Be Persistent!

First, use persistent memory if you have it. On modern GL 4.x drivers on Windows/Linux, the driver can permanently map buffers for streaming via GL_ARB_buffer_storage. This is plan A! This will be the fastest path you can find because you pay no overhead for streaming geometry - you just write the data. (It's also multi-core friendly because you can grab a region of mapped memory without having to talk to the driver at all, avoiding multi-context hell.)

That persistent memory is a win is unsurprising - you can't get any faster than not doing any work at all, and persistent memory simply removes the driver from the equation by giving you a direct memory-centric way to talk to the GPU.

Don't Be Uncool

Second, if you don't have persistent memory (e.g. you are on OS X), use system memory via client arrays, rather than trying to jam your data into a VBO with glMapBuffer or glBufferSubData.

This second result surprised me, but in every test I've run, client arrays in system memory have out-performed VBOs for small-to-medium sized batches. We were already using system memory for small-batch vertex drawing, but it's even faster for larger buffers.

Now before you go and delete all your VBO code, a few caveats:

  • We are mostly testing small-batch draw performance - this is UI, some effects code, but not million-VBO terrain chunks.
  • The largest streaming data I have tried is a 128K index buffer. That's not tiny - that's perhaps 32 VM pages, but it's not a 2 MB static mesh.
  • It wouldn't shock me if index buffers are more friendly to system memory streaming than vertex buffers - the 128K index buffer indexes a static VBO.

Why Would Client Arrays Be Fast?

I'd speculate, they're easier to optimize.

Unlike VBOs, in the case of client arrays, the driver knows everything about the data transfer at one time. Everything up until an actual draw call is just stashing pointers for later use - the app is required to make sure the pointers remain valid until the draw call happens.

When the draw call happens, the driver knows:

  • How big the data is.
  • What format the data is in.
  • Which part of the data is actually consumed by the shader.
  • Where the data is located (system memory, duh).
  • That this is a streaming case - since the API provides no mechanism for efficient reuse, the driver might as well assume no reuse.
There's not really any validation to be done - if your client pointers point to junk memory, the driver can just segfault.

Because the driver knows how big the draw call is at the time it manages the vertex data, it can select the optimal vertex transfer mode for the particular hardware and draw call size. Large draws can be scheduled via a DMA (worth it if enough data is being transferred), medium draws can be sourced right from AGP memory, and tiny draws could even be stored directly in the command buffer.

You Are Out of Order

There's one last thing we know for client arrays that we don't know for map/unmap, and I think this might be the most important one of all: in the case of client arrays, vertex transfer is strictly FIFO - within a single context (and client arrays data is not shared) submission order from the client is draw/retirement order.

That means the driver can use a simple ring buffer to allocate memory for these draw calls. That's really cheap unless the total size of the ring buffer has to grow.

By comparison, the driver can assume nothing about orphaning and renaming of VBOs. Rename/map/unmap/draw sequences show up as ad hoc calls to the driver, so the driver has to allocate new backing storage for VBOs out of a free store/heap. Even if the driver has a magazine front-end, the cost of heap allocations in the driver is going to be more expensive than bumping ring buffer pointers.

What Can We Do With This Knowledge?

Once we recognize that we're going to draw only with client arrays and persistent memory (and not with non-persistent mapped and unmapped VBOs), we can recognize a simplifying assumption: our unmap/flushing overhead is zero in every case, and we can simplify client code around this.

In a previous post, I suggested two ways around the cost of ensuring that your data is GPU-visible: persistent memory and deferring all command encoding until later.

If we're not going to have to unmap, we can just go with option 1 all of the time. If we don't have persistent coherent memory, we treat system memory as our persistent coherent memory and draw with client arrays. This means we can drop the cost of buffering up and replaying our command encoding and just render directly.

Tuesday, March 27, 2018

There Must Be Fifty Ways to Fail Your Stencil

When we first put deferred rendering into X-Plane 10.0, complete with lots of spot lights in-scene, I coded up stencil volumes for the lights in an attempt to save some shading power. The basic algorithm is:

  • Do a stencil pre-pass on all light volumes where:
    • The back of the volume failing increments. This happens when geometry is in front of the back of the light volume - this geometry might be lit!
    • The front of the volume failing decrements. This happens when geometry is in front of the front light volume and thus occludes (in screen space) anything that could have been lit.
  • Do a second pass with stencil testing for > 0. Only pixels with a positive count had geometry that were in between the two halves of the bounding volume, and thus light candidates.
This technique eliminates both fragments of occluded lights and fragments where the light shines through the air and hits nothing.

Typically the stencil modes are inc/dec with wrapping so that we aren't dependent on our volume fragments going out in any particular order - it all nets out.

We ended up not shipping this for 10.0 because it turned out the cure was worse than the disease - hitting the light geometry a second time hurt fps more than the fill savings for a product that was already outputting just silly amounts of geometry.

I made a note at the time that we could partition our lights and only stencil the ones in the first 200m from the camera - this would get all the fill heavy lights without drawing a ton of geometry.

I came back to this technique the other day, but something had changed: we'd ended up using a pile of our stencil bits for various random purposes, leaving very little behind for stencil volumes. We were down to 3 bits for our counter, and this was the result.

That big black void in between the lights in the center of the screen is where the number of overlapping non-occluded lights hitting a light-able surface hit exactly the wrap-around point in our stencil buffer - we got eight increments, wrapped to zero and the lights were stencil-tested out. The obvious way to cope with this is to use more than 3 stencil bits. :-)

I looked at whether there was something we could do in a single pass. Our default mode is to light with the back of our light volume, untested; the far clip plane is, well, far away, so we get good screen coverage.

I tried lighting with the front of the light volume, depth tested, so that cases where the light was occluded by intervening geometry would optimize out.  I used GL_ARB_depth_clamp to ensure that the front of my light volume would be drawn even if it hit the front clip plane.

It did not work! The problem is: since our view is a frustum, the side planes of the view volume cross at the camera location; thus if we are inside our light volume, the part behind us will be culled out despite depth clamp. This wasn't a problem for stencil volumes because they do the actual drawing off the back of the volume, and the front is just for optimization.

Monday, January 29, 2018

Flush Less Often

Here's a riddle:
Q: What does a two year old and OpenGL have in common? 
A: You never know when either of them is going to flush.*
In the case of a two-year old, you can wait a few years and he'll find different ways to not listen to you; unfortunately in the case of OpenGL, this is a problem of API design; therefore we have to use a fairly big hammer to fix it.

What IS the Flushing Problem?

Modern GPUs work by writing a command buffer (a list of binary encoded drawing instructions) to memory (using the CPU) and then "sending" that buffer to the GPU for execution, either by changing ownership of the shared memory or by a DMA copy.

Until that buffer of commands goes to the GPU, from the GPU's perspective, you haven't actually asked it to do anything - your command buffer is just sitting there, collecting dust, while the GPU is idle.

In modern APIs like Vulkan, Metal, and DX12, the command buffer is an object you build, and then you explicitly send it to the GPU with an API call.

With OpenGL, the command buffer is implicit - you never see it, it just gets generated as you make API calls. The command buffer is sent to the GPU ("flushed") under a few circumstances:
  1. If you ask GL to do so via glFlush.
  2. If you make a call that does an automatic flush (glFinish, glSwapBuffer, waiting on a sync with the flush bit).
  3. If the command buffer fills up due to you doing a lot of stuff.
This last case is the problematic one because it's completely unpredictable.

Why Do We Care?

Back in the day, we didn't care - you'd write commands and buffers would go out when they were full (ensuring a "goodly amount of work" gets sent to the GPU) and the last command buffer was sent when you swapped your back buffer.

But with modern OpenGL, calling the API is only a small fraction of the work we do; most of the work of drawing involves filling buffers with numbers. This is where your meshes and hopefully constant state are all coming from.

The flushing problem comes to haunt us when we want to draw a large number of small drawing batches. It's easy to end up with code like this:

// write some data to memory
// write some data to memory

Expanding this out, the code actually looks more like:

// map a buffer
// write to the buffer
// flush and unmap the buffer
// map a buffer
// write to the buffer
// flush and unmap the buffer

The problem is: even with glMapBufferRange and "unsynchronized" buffers, you still have to issue some kind of flush to your data before each drawing call.

The reason this is necessary is: glDrawElements might cause your command buffer to be sent to the GPU at any time! Therefore you have to have your data buffer completely flushed and ready to go after every drawing call.

How Do We Fix It?

You basically have two choices to make code like the above fast:

  1. If your are on a modern GL, use persistent coherent buffers. They don't need to be flushed - you can write data, call draw, and if the GL happens to send the command buffer down, your data is already visible. This is a great solution for UBOs on Windows.
  2. If you can't get persistent coherent buffers, defer all of your actual state and draw calls until every buffer has been built.

This second technique is a double-edged sword.

  • Win: it works every-where, even on the oldest OpenGL.
  • Win: as long as you're accumulating your state change, you can optimize out stupid stuff - handy when client code tends to produce crap OpenGL call-streams.
  • Lose: it does require you to marshal the entire API, so it's only good for code that sits on a fairly narrow foot-print.
For X-Plane, we actually intentionally choose not to use UBOs when persistent-coherent buffers are not also available. It turns out the cost of flushing per draw call is really bad, and our fallback path (loose uniforms) is actually surprisingly fast, because the driver guys have tuned the bejeezus out of that code path.

* My two-year old has figured out how to flush the toilet and thinks it's fascinating. What he hasn't figured out how to do is listen^H^H^H^H^Hwait until I'm done peeing. (And yes, non-parents, of coarse peeing is a group activity. Duh.)  The monologue went something like:

"Okay Ezra, wait until Daddy's done. No, not yet. It's too soon. Don't flush. Ezra?!  Sigh.  Wait, this is exactly like @#$@#$ glDrawElements!"

Saturday, January 13, 2018

Fixing Camera Shake on Single Precision GPUs

I've tried to write this post twice now, and I keep getting bogged down in the background information. In X-Plane 11.10 we fixed our long-standing problem of camera shake, caused by 32-bit floating point transforms in a very large world.

I did a literature search on this a few months ago and didn't find anything that met our requirements, namely:

  • Support GPUs without 64-bit floating point (e.g. mobile GPUs).
  • Keep our large (100 km x 100 km) mesh chunks.

I didn't find anything that met both of those requirements (the 32-bit-friendly solutions I found required major changes to how the engine deals with mesh chunks), so I want to write up what we did.

Background: Why We Jitter

X-Plane's world is large - scenery tiles are about 100 km x 100 km, so you can be up to 50 km from the origin before we "scroll" (e.g. change the relationship between the Earth and the primary rendering coordinate system so the user's aircraft is closer to the origin).  At these distances, we have about 1 cm of precision in our 32-bit coordinates, so any time we are close enough to the ground that 1 cm is larger than 1 pixel, meshes will "jump" by more than 1 pixel during camera movement due to rounding in the floating point transform stack.

It's not hard to have 1 pixel be larger than 1 cm. If you are looking at the ground on a 1920p monitor, you might have 1920 pixels covering 2 meters, for about 1 mm per pixel.  The ground is going to jitter like hell.

Engines that don't have huge offsets don't have these problems - if we were within 1 km of the origin, we'd have almost 100x more precision and the jitter might not be noticeable. Engines can solve this by having small worlds, or by scrolling the origin a lot more often.

Note that it's not good enough to just keep the main OpenGL origin near the user. If we have a large mesh (e.g. a mesh whose vertices get up into the 50 km magnitude) we're going to jitter, because at the time that we draw them our effective transform matrix is going to need an offset to bring the 50 km offset back to the camera.  (In other words, even if our main transform matrix doesn't have huge offsets that cause us to lose precision, we'll have to do a big translation to draw our big object.)

Fixing Instances With Large Offsets

The first thing we do is make our transform stack double precision on the CPU (but not the GPU). To be clear, we need double precision:
  • In the internal transform matrices we keep on the CPU as we "accumulate" rotates, translates, etc.
  • In the calculations where we modify this matrix (e.g. if we are going to transform, we have to up-res the incoming matrix, do the calculation in double, and save the results in double).
  • We do not have to send the final transforms to the GPU in double - we can truncate the final model-view, etc.
  • We can accept input transforms from client code in single or double precision.
This will fix all jitter caused by objects with small offset meshes that are positioned far from the origin.  Eg. if our code goes: push, translate (large offset), rotate (pose), draw, pop, then this fix alone gets rid of jitter on that model, and it doesn't require any changes to the engine or shader.

We do eat the cost of double precision in our CPU-side transforms - I don't have numbers yet for how much of a penalty on old mobile phones this is, but on desktop this is not a problem. (If you are beating the transform stack so badly that this matters, it's time to use hardware instancing.)

This hasn't fixed most of our jitter - large meshes and hardware instances are still jittering like crazy, but this is a necessary pre-requisite.

Fixing Large Meshes

The trick to fixing jitter on meshes with large vertex coordinates is understanding why we have precision problems.  The fundamental problem is this: transform matrices apply rotations first and translations second. Therefore in any model-view matrix that positions the world, the translations in the matrix have been mutated by the rotation basis vectors. (That's why your camera location is not just items 12,13, and 14 of your MV matrix.)

If the camera's location in the world is a very big number (necessary to get you "near" those huge-coordinate vertices so you can see them) then the precision at which they are transformed by the basis vectors is...not very good.

That's not actually the total problem. (If it was, preparing the camera transform matrix in double on the CPU would have gotten us out of jail.)

The problem is that we are counting on these calculations to cancel each other out:

vertex location * camera rotation + (camera rotation * camera location) = eye space vertex

The camera rotated location was calculated on the CPU ahead of time and baked into the translation component of your MV matrix ,but the vertex location is huge and is rotated by the camera rotation on the GPU in 32-bits.  So we have two huge offsets multiplied by very finicky rotations - we add them together and we are hoping that the result is pixel accurate, so that tiny movements of the camera are smooth.

They are not - it's the rounding error of the cancelation of these calculations that is our jitter.

The solution is to change the order of operations of our transform stack. We need to introduce a second translation step that (unlike a normal 4x4 matrix operation), happens before rotation, in world coordinates and not camera coordinates.  In other words, we want to do this:

(vertex location - offset) * camera rotation + (camera rotation * (camera location - offset)) = ...

Heres' why this can actually work: "offset" is going to be a number that brings our mesh roughly near the camera. Since it doesn't have to bring us to the camera, it can change infrequently and have very few low-place bits to get lost by rounding.  Since our vertex location and offset are not changing, this number is going to be stable across frames.

Our camera location minus this offset can be done on the CPU side in double precision, so the results of this will be both small (in magnitude) and high precision.

So now we have two small locations multiplied by the camera rotation that have to cancel out - this is what we would have had if our engine used only small meshes.

In other words, by applying a rounded, infrequently  changing static offset first, we can reduce the problem to what we would have had in a small-world engine, "just in time".

You might wonder what happens if the mesh vertex is no-where near our offset - my claim that the result will be really small is wrong. But that's okay - since the offset is near the camera, mesh vertices that don't cancel well are far from the camera and too small/far away to jitter. Jitter is a problem for close stuff.

The CPU-side math goes like this: given an affine model-view matrix in the form of R, T (where R is the 3x3 rotation and T is the translation vector), we do this:

// Calculate C, the camera's position, by reverse-
// rotating the translation
C = transpose(R) * T
// Grid-snap the camera position in world coordinates - I used 
// a 4 km grid. smaller grids mean more frequent jumps but 
// better precision.
C_snap = grid_round(C)
// Offset the matrix's translation by this snap (moved back 
// to post-rotation coordinates), to compensate for the pre-offset.
T -= R * C_snap
// Pre-offset is the opposite of the snap.
O = -C_snap

In our shader code, we transform like this:

v_eye = (v_world - O) * modelview_matrix

There's no reason why the sign has to be this way - O could have been C_snap and we could have added in the shader; I found it was easier to debug having the offset be actual locations in the world.

Fixing Hardware Instancing

There's one more case to fix. If your engine has hardware instancing, you may have code that takes the (small) model mesh vertices and applies an instancing transform first, then the main transform. In this case, the large vertex is the result of the instancing matrix, not the mesh itself.

This case is easily solved - we simply subtract our camera offset from the translation part of the hardware instance. This ensures that the instance, when transformed into our world, will be near the camera - no jitter.

One last note: on some drivers I found the driver was very finicky about order of operations - if the calculation is not done by applying the offset before the transform, the de-jitter totally fails. The precise and invariant qualifiers didn't seem to help, only getting the code "just right" did.