## Monday, September 08, 2008

### Geometry Shader Performance on the 8800

So this is what I learned today: when it comes to geometry shaders and the 8800, triangle strips matter. Now after you read the details, this will seem so obvious that you can only conclude that I am a complete dufus (something I will not necessarily dispute). But the 8800 (like most modern cards) is so bloody fast that triangle strips are actually not a win in almost all other configurations.

The test: a mesh of 1000 x 1000 quads (each in turn is two triangles), being rotated. Using a single static vertex buffer with static indexes, this runs at around 50-55 fps. Each vertex has 8 components (XYZ, normal, texture ST).

Now some numbers:
• The baseline is around 54 fps.
• Cutting the geometry to a 500x500 mesh brings us to around 204 fps, which is what we expect for a vertex-bound operation. The pixel shading has been kept intentionally simple to achieve this result.
• Using a geometry shader which simply passes through the geometry has no affect on fps.
• Cutting the mesh to 500x500 and using a geometry shader that splits one triangle into four by emitting 12 vertices and 4 primitive (e.g. tris) ends runs at a creeping 25 fps.
• Cutting the mesh to 500x500 and using a geometry shader that splits one triangle into four by emitting eight vertices and 2 primitives (e.g. strips) runs at 68 fps.
• When using this strip-based geometry shader, sorting the mesh indices by strip format (e.g. 0 1 2 2 1 3 2 3 4 4 3 5) improves fps to 73 fps or so. When not using the geometry shader, this strip sorting has no impact
Let's tease that mess apart and see what it means. Basically my goal was to test the performance of "dynamically created" geometry (e.g. creating more vertices from less using a geometry shader) vs. "mesh updating" (e.g. periodically re-tessolating the mesh and saving the results to new VBOs. The later technique's best performance is simulated by the 1000x1000 VBO in VRAM; the former by the geometry shader.

As you can see, geometry shaders can outperform straight VBO drawing, but only if they are set up carefully. In particular, you can't have multiple-separate-triangle primitives in a geometry shader output, so if we want to draw distinct triangles, we have to end a lot of primitives. There is also no vertex indexing out the back of a geometry shader, so strips are a win.

(Contrast this to drawing out of a VBO - with indexing and multiple triangles per call, and a huge cost to restarting primitives, GL_TRIANGLES indexed is usually faster than strips.)

What's surprising here is not that strips are faster in the geometry shader, but that they are so much faster! With strips we've cut down the geometry data by about 30% (from 12 vertices to 8), but we get an almost 3x improvement in throughput. My theory is that emitting fewer primitives is what wins; we've cut down geometry and cut the number of primitives in half.

The moral of the story is: it pays to find a way to strip-ify the output of geometry shaders.

1. Hi, I've found this site thru google, because I've noticed a problem when using GS and your article denies it...

When I'm using Fragment Shader and Vertex Shader (simplest - vs for vertex transformation and fs for drawing a white pixel) I can obtain 320 FPS for my heightmap. When I attach the simplest Geometry Shader then I end up with 145 FPS. Can You tell me how come "Using a geometry shader which simply passes through the geometry has no affect on fps." this is true? Can the problem lie in the fact that I have ATI card (problem with the driver) or that I use GLSL? Thanks in advance for your reply.

PS: When I want to send normal vector from VS to FS (and unavoidably through GS) I get 124 FPS :/

Regards.

2. If you are using an ATI card I would expect _no_ consistency with the numbers I posted. You'll need to characterize GS performance for yourself on the target drivers you care about supporting in your shipping product. This is the way of OpenGL: it will work everywhere, but it's not guaranteed to be fast everywhere.

My overall conclusion has been that GS simply aren't a terribly fast path across a wide range of hardware. But that's based on the perf my product needs on the cards and plaforms my company cares about.

3. Ok, thanks. I just had many problems regarding to some details with shaders, so I thought that it's my local problem.