The Hacks of Life: 12/01/2007

Friday, December 21, 2007

What's Wrong With This Code

What's wrong with this code, which uses a second thread to build new nodes in a scene graph while the main thread renders and periodically picks up finished work?

thread 2:
 create mesh geo in memory
 glBufferData to create a vbo
 send msg to thread 1 with VBO ID
thread 1:
 while(1)
 {
    if we get a msg
       insert VBO into scene graph
  draw scene graph
 }

Or how about this logic, which uses threads and PBOs to asynchronously capture full screen to disk?

thread 1:
 render frame
 bind pbo
 glReadPixels
 unbind pbo
 send msg to thread 2

thread 2:
 wait for message
 bind pbo
 get pbo data
 unbind pbo
 write data to disk

Hint: the answer is the same!

Updated OpenGL Performance Lore

Performance Lore (e.g. "I did X on my app and it got faster so you should do Y") is inherently dangerous in that performance tuning without performance analysis is a bad idea. With that in mind, it does have a use, I think, in at least pointing out possible areas of exploration.

With that in mind, some notes on X-Plane 9 performance:

It doesn't take a very complex pixel shader to become fill-rate bound! In version 8 (shaders were almost like the fixed function pipeline) X-Plane was almost always limited by the CPU and our ability to find and emit batches, even at high FSAA. With X-Plane 9, even on high end cards, even modest sized shaders are the limiting factor. Never underestimate the cost of doing something ten or twenty million times!
Overdraw is bad! As soon as you get into shaders, you become fill-rate bound and then overdraw becomes so expensive. So using blending and glPolygonOffset to composite into the frame buffer becomes a bad idea.
Geometry performance: putting index buffers into VBOs gained us some speed.
Optimizing down the number of vertices by more aggressively using indexing gained some speed, and never hurt, even if the mesh was a bit out of order, even on older hardware.
Using 16-bit indices on Mac made no difference - perhaps they're not natively supported?
Using the "static" VBO hint instead of "stream" for non-repeating geometry gave a tiny framerate improvement in a few cases, but mostly caused massive performance failures.

A note on this last point: the driver essentially has to guess a resource usage pattern from your API calls. A lot of them will use a most-recently-used purge scheme, that is, once you run out of VRAM you throw out the most recent thing you used and replace it. If we used a real FIFO, we'd guarantee the worst case: we would flush and reload every resource into VRAM every frame, so MRU makes sense.

The stream vs. static VBO hint will usually get your VBO into VRAM (static) or simply use it by mapping via AGP memory (stream). If your total working set is larger than VRAM (something's going to go over the bus) and a VBO is used only once, stream is the best choice; save VRAM for things that are used repeatedly, like repeating meshes and textures.

So when I tried jamming everything into VRAM, we ran out of VRAM in a big way and started to thrash...MRU works well when we're over budget by a tiny amount, but for a huge shortfall we just end up with thrash.

None of this is really surprising; people have been telling us we'd be shader-bound for years...it just took us a while to get there.

Thursday, December 20, 2007

Vtune - where's the stack?

I used VTune the other day to look at some performance problems on my PC. (Turns out the performance problem is the PC.) There's one thing about VTune I don't quite get:

When you're using the adaptive sampling profiler (system-timer-driven sampling) you only get function names.
To get a stack crawl you need to instrument the application (invasive profiling).

Now I don't like invasive profiling at all for a few reasons...

It changes the performance characteristics of the application.
It's not really compatible with inlining.

When sampling with Shark on the Mac, you get the stack history of each sample, not just the function itself. This is pretty important when leaf functions are bottlenecks from certain directions. For example, it's one thing to know you're spending all your time in glDrawElements (duh). But who is calling it? The OBJ engine? The mesh engine? Something else?