Comments on The Hacks of Life: Instancing for BrickSmith

This may sound crazy but: we don't have reliab...

2013-03-09T09:30:49.977-05:00

This may sound crazy but: we don't have reliable access to hardware-accelerated geometry shaders on OS X for our supported set of operating systems.

My general view on occlusion queries is this: I have never seen an occlusion query system simpler to write than an LOD system. And there's a lot to be said for LOD under all conditions - consider that a single baseplate can generate 180,224 vertices right now. (The odds of the entire baseplate being occluded are poor and the rendering output quality of that many vertices in a small space looks lousy.)

So my plan of action is to ignore occlusion culling temporarily - and do something like this:

1. Implement sane basic LOD.
2. Observe the new performance characteristics.
3. Implement the DUMBEST occlusion system I can, only to get metrics on the 'real' savings with real lego models.
4. Only then, examine various occlusion culling schemes if we find from (2) that we need more vertex saving and (3) that occlusion savings will deliver.

BrickSmith is an open source 'hobby program' to all of its authors, so there's no way that we can invest the time to do sane and proper occlusion culling without first determining that the payoff is there. Heck - I'd do the same pre-research for my day job work too!

Given the heavy instancing of Lego models, maybe t...

2013-03-06T14:45:29.643-05:00

Given the heavy instancing of Lego models, maybe this would be even more beneficial than occlusion culling:
http://rastergrid.com/blog/2010/02/instance-culling-using-geometry-shaders/

Have you considered fully GPU-based occlusion, rou...

2013-03-06T13:23:19.775-05:00

Have you considered fully GPU-based occlusion, roughly:

1. Render the parts as bounding boxes to a mip-mapped depth texture.
2. Render the parts as bounding boxes again, this time querying the depth texture for occlusion.
3. If potentially visible, geometry shader renders the actual part. Otherwise geometry shader discards the part.

I'm relaying this idea from http://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/ which includes a working sample.

Mipmap, you're right about the massive over-dr...

2013-02-22T20:24:41.944-05:00

Mipmap, you're right about the massive over-draw; I'd have to read the fine print of the GL spec to see how much actual overdraw is happening (because most of those triangles are going to end up sub-pixel in size). Either way, it's terribly wasteful. :-)

I think the problem with occlusion culling is that we'd need additional data of some kind to do the occlusion, and no matter which way you slice it, that data can't come from the existing LDraw data.

Is there an occlusion culling data structure you like that would be amenable to a huge number of small parts (that together may or may not cover useful areas) when offline pre-processing isn't an option?

Regarding your 39,000 bricks/125 million vertices ...

2013-02-22T17:38:04.662-05:00

Regarding your 39,000 bricks/125 million vertices @ 5fps case, that's around 90 vertices *per pixel* on a 2560x1440 monitor, aka massive overdraw.

Besides LOD techniques such as mentioned above to mitigate this, how about occlusion culling? Lego models should in general be very amenable to this.

Alex, you are totally correct that BrickSmith is u...

2013-02-04T09:51:16.308-05:00

Alex, you are totally correct that BrickSmith is using the GPU asymmetrically - in particular, when a brick is far away (and thus rendered at low res) a shader based procedural set of studs would be better than geometry.

BrickSmith is an editor for Ldraw files, and the ldraw format is pre-specified, so substituting procedural definitions with chunks of geometry, while possible, isn't really in the application design domain.

cool post! not particularly familiar with bricksmi...

2013-02-04T08:45:42.132-05:00

cool post! not particularly familiar with bricksmith, but am with recent GPUs. they have a lot of fragment shader power you're not using. my first thought while reading your post, in trying to reduce your vertex-bound bottleneck, would be to render the lego 'bobbles' with a special bobble shader, as simple bounding geometry (either a world-space cubeoid, or, a screenspace bounding rectangle - ie either 8 or 4 vertices) and use a fragment shader that 'raytraces' against the analytic equation of a capped cylinder; that would also give you perfect round bobbles, and balance the use of your GPU better - far fewer vertices. I know there's a pool/snooker game that does this for their spheres, for example, (I can't remember the reference at the moment, but google will probably find it). the maths for ray-cylinder and ray-plane (for the caps) is really simple.
you could of course fall-back to your existing vertex-based bobble geometry for cards that do not support fancy fragment shaders.

Closed: all of your techniques are totally valid f...

2013-01-31T10:44:51.389-05:00

Closed: all of your techniques are totally valid for boosting instancing speed, but in the case of BrickSmith, performance data I have already gathered indicates they won't help, because of the shape of Bricksmith's source data. Here's some of the things I measured...note that this applies to ATI Mac drivers only; BrickSmith is Mac-only software. My experience with the NV Mac driver is that it isn't competitive perf-wise. :-(

(BrickSmith is also not yet ported to the 3.2 core profile on Mac, so I have not been able to test TBOs on NV Mac hardware.)

Would smaller instance data help? I don't think it would in this case because we're not bandwidth limited in pushing the relatively small number of instance objects. I do believe that if we had a larger number of simpler instances this _would_ start to matter.

Would a shorter vertex shader help? I don't think so; I tried simplifying vertex shader and fragment shader calculations and the needle didn't move.

Would changing the index sourcing from attrib-divisor to TBOs help? Would having a smaller number of bigger instances help? I don't think so; I have seen a multi-mode test program that can vary the instance count, instance object mesh size, and instance source (vertex divisor, UBO, and TBO) and max vertex throughput is flat across everything - more instances trades off perfectly with smaller instances (to a point); and you can have a surprisingly small instance mesh (e.g. 100 vertices). The bottleneck really is in triangle setup.

One other note on mesh size: a single lego stud in BrickSmith is 56 vertices -- at that number virtually every single "brick" is going to be at least 100 vertices - that is, big enough to be efficient from an instance-stream perspective.

Fortunately there just aren't _that_ many kinds of bricks in the lego world -- the test model I used has 39,690 bricks but only 548 unique bricks. That's just not a huge number of instance-batches. And if we're drawing 125M instance vertices in only 548 batches, we're well into the efficient range of the GPU -- no merging necessary. :-)

The merged instancing idea seems interesting but unnecessary - I have not seen any indication that the hardware can't re-run the mesh buffer (without duplication) efficiently. (On Windows NV hardware attribute divisors appear to operate acceptably in X-Plane, which is what makes me think the NV Mac performance is driver related and not a hardware limitation..this was on the new 650M laptops...)

Sorry, but this implementation is very slow: 1) Re...

2013-01-31T02:06:11.714-05:00

Sorry, but this implementation is very slow:
1) Reduce per instance data:
it is possible to pack all (and more) your data to 8 floats and then unpack it on shader using GLSL unpack* functions (do not store matrices directly, separate in position and XY local rotation)
2) Do not use glVertexAttribDivisor (see 3))
3) Use merged instancing: duplicate your model 100-1000 times (just do vertex buffer larger), than use complex indexing on VS, like:

int i = INSTANCE_OFFSET + gl_InstanceID*INSTANCE_BATCH_SIZE + gl_VertexID/ELEMENT_VERTEX_COUNT;
i *= PIXELS_PER_INSTANCE;

than, fetch and unpack data:

float3 inGlobalPosition = texelFetch(instanceBuffer_unit7, i ).xyz;
float3 instanceAttrib = texelFetch(instanceBuffer_unit7, i+1 ).xyz;
uint3 instanceAttribI = floatBitsToUint(instanceAttrib);
float2 inAzimuthSinCos = unpackHalf2x16(instanceAttribI.x);
float4 inInstanceAttributes = unpackUnorm4x8(instanceAttribI.y);

Yes, this way is complex, and requires a lot of instructions, BUT: it is very fast, like a speed of light.

Main problem in your engine: very low VS rate, because your main VBO is very small.

P.S. Sorry, if I misunderstood your post (: