Friday, December 21, 2007

What's Wrong With This Code

What's wrong with this code, which uses a second thread to build new nodes in a scene graph while the main thread renders and periodically picks up finished work?
thread 2:
create mesh geo in memory
glBufferData to create a vbo
send msg to thread 1 with VBO ID
thread 1:
while(1)
{
if we get a msg
insert VBO into scene graph
draw scene graph
}

Or how about this logic, which uses threads and PBOs to asynchronously capture full screen to disk?

thread 1:
render frame
bind pbo
glReadPixels
unbind pbo
send msg to thread 2

thread 2:
wait for message
bind pbo
get pbo data
unbind pbo
write data to disk


Hint: the answer is the same!

Updated OpenGL Performance Lore

Performance Lore (e.g. "I did X on my app and it got faster so you should do Y") is inherently dangerous in that performance tuning without performance analysis is a bad idea. With that in mind, it does have a use, I think, in at least pointing out possible areas of exploration.

With that in mind, some notes on X-Plane 9 performance:
  • It doesn't take a very complex pixel shader to become fill-rate bound! In version 8 (shaders were almost like the fixed function pipeline) X-Plane was almost always limited by the CPU and our ability to find and emit batches, even at high FSAA. With X-Plane 9, even on high end cards, even modest sized shaders are the limiting factor. Never underestimate the cost of doing something ten or twenty million times!
  • Overdraw is bad! As soon as you get into shaders, you become fill-rate bound and then overdraw becomes so expensive. So using blending and glPolygonOffset to composite into the frame buffer becomes a bad idea.
  • Geometry performance: putting index buffers into VBOs gained us some speed.
  • Optimizing down the number of vertices by more aggressively using indexing gained some speed, and never hurt, even if the mesh was a bit out of order, even on older hardware.
  • Using 16-bit indices on Mac made no difference - perhaps they're not natively supported?
  • Using the "static" VBO hint instead of "stream" for non-repeating geometry gave a tiny framerate improvement in a few cases, but mostly caused massive performance failures.
A note on this last point: the driver essentially has to guess a resource usage pattern from your API calls. A lot of them will use a most-recently-used purge scheme, that is, once you run out of VRAM you throw out the most recent thing you used and replace it. If we used a real FIFO, we'd guarantee the worst case: we would flush and reload every resource into VRAM every frame, so MRU makes sense.

The stream vs. static VBO hint will usually get your VBO into VRAM (static) or simply use it by mapping via AGP memory (stream). If your total working set is larger than VRAM (something's going to go over the bus) and a VBO is used only once, stream is the best choice; save VRAM for things that are used repeatedly, like repeating meshes and textures.

So when I tried jamming everything into VRAM, we ran out of VRAM in a big way and started to thrash...MRU works well when we're over budget by a tiny amount, but for a huge shortfall we just end up with thrash.

None of this is really surprising; people have been telling us we'd be shader-bound for years...it just took us a while to get there.

Thursday, December 20, 2007

Vtune - where's the stack?

I used VTune the other day to look at some performance problems on my PC. (Turns out the performance problem is the PC.) There's one thing about VTune I don't quite get:
  • When you're using the adaptive sampling profiler (system-timer-driven sampling) you only get function names.
  • To get a stack crawl you need to instrument the application (invasive profiling).
Now I don't like invasive profiling at all for a few reasons...
  • It changes the performance characteristics of the application.
  • It's not really compatible with inlining.
When sampling with Shark on the Mac, you get the stack history of each sample, not just the function itself. This is pretty important when leaf functions are bottlenecks from certain directions. For example, it's one thing to know you're spending all your time in glDrawElements (duh). But who is calling it? The OBJ engine? The mesh engine? Something else?

Friday, September 07, 2007

How does COM know?

If you call an a non-blocking method on a COM object that's in a different apartment and you need to receive callbacks, you need to sit in a message pump of some kind. That fact is written in a few different blogs, but what I didn't understand until today is: why does that work? How is COM connected to the message pump?

First let's pull apart that statement with some basics:
  • Single-Threaded Apartment (STA) basically puts boundaries around sets of objects that are similar to the boundaries around a process. When calling between the apartments (sets of COM objects) we use an RPC mechanism instead of a direct function call.
  • There is a 1:1 correspondence between threads and apartments, and a 1:1 correspondence between threads and their message queues. Thus there is an appropriate message queue for each apartment, and posting to that queue will assure which thread gets the message.
  • RPCs are thus implemented by posting messages to the queue of the thread we're trying to reach. We then poll our own message queue until we get some kind of "reply" indicating the RPC has done. This looks like a blocking function call to client code.
In order for this to work, the thread in the apartment we are calling into must itself be waiting on a message queue. This would be the case if it is either (1) really bored and just querying the message queue or (2) it is itself blocked on a method call into another apartment, and is thus polling its queue to find out if its own RPC is done.

If this all seems like insanity to you, well, it is.

Now when I say "non-blocking" method call, what I really mean is: a method call that returns really fast but starts some work to be completed later.

Normally when a thread is blocked because it made an RPC into another apartment, that apartment can call right back because the same polling of the message queue to discover that the RPC is over allows other methods to be called. This simply means that the flow of code between COM objects can ignore STA when all method calls are "blocking".

But as soon as we have a non-blocking call, there is no guarantee that the client code is actually listening for method calls into its apartment. (By the rules of STA, if the thread is doing stuff, no calls CAN happen, because one thread per apartment.)

Typically client code will make the async call, maybe make a few, and then do some kind of blocking until we're done..for example, we might call WaitForMultipleObjects.

In this case the right thing to do is MsgWaitForMultipleObjects (followed by GetMessage/DispatchMessage if we get woken up for a message). This way while our thread is doing nothing, other apartments can call us back.

This works because the thread, message queue, and apartment are all 1:1 relationships. So to say "this thread needs to be open to COM RPCs" all we need to say is "this thread needs to block on its own message queue", which is done with GetMessage.

Friday, July 20, 2007

gl_FragCoord + ATI = World of Hurt

Well we just had a bit of a firedrill when we learned that gl_FragCoord causes some ATI cards on windows to revert to software shading. I don't know if this is fixed in a driver upgrade - we used a simple workaround suggested by a user on gamedev.net:
  • We store the clip-space coordinate in a varying variable. (We have to compute that anyway, it's just the result of ftransform()).
  • We then perform a perspective divide and change the coordinate range in the fragment shader.
Fortunately we wanted the screen coordinate as 0..1 so having clip coordinates is almost what we want anyway. The perspective divide must be per-pixel - varying variables only interpolate in a perspective-correct manner in homogenous clip coordinates, not window coordinates. At least, I think. :-)

This is a case where "the Google" comes in really handy -- having confirmation from other users of a known bug gives us a lot more confidence in the workaround.

Friday, June 15, 2007

list.size() is slow

Ever wonder how the STL list works in GCC?



What? No? You haven't?

Where's your sense of curiousity about the way your code works?

What? You have a life?!? What are you reading this blog for then?

Anyway, it's pretty simple. It's a doubly-linked list, where the node has a previous and next pointer. The head and tail of the list are linked together, and the list itself keeps a pointer to the head node.

This means that begin() and end() are fast.

Unfortunately, the list doesn't keep a size counter. This means that size() is O(n). So you can find the back of the list easily but getting its index is expensive.

That's probably fine, as turning the index back into an iterator is O(n) and the iterators are stable. By comparison though, CodeWarrior's list caches the number of items.

Moral of the story: empty() is more efficient than size()==0, and, um, know what's in your STL.

(Insert rant about STL being like sausage here...)

Thursday, June 14, 2007

Virtual Memory Dumping for Fun and Profit

We're trying to answer the question: why did X-Plane run out of memory? To answer that, we need to look at the contents of virtual memory. Now dumping virtual memory isn't so bad.

On Windows you call VirtualQuery with a base address. Start at 0 and increment by the returned RegionSize. When you get to dwTotalVirtual (as returned by GlobalMemoryStatus) that's a good time to stop.

On the Mac call vm_region, starting at 0 and incrementing by "size" until it returns an error code. One tricky thing: vm_region will skip over unused virtual memory. To "see" these holes (this is address space that can be used later) compare the address you pass in to the one that's returned. If you pass in a pointer that's in an unused region, the address will be advanced to the next region.

What if you want to show the DLLs/dylibs that are mapped into a given region? I make no promise for the performance of this method but...

On Windows, use GetMappedFileName. Pass in your process handle and a base address from VirtualQuery and it fills in a buffer with a DLL name if possible. This is in psapi.dll so it isn't available on non-NT-derived Windows.

On Mac, use dladdr, passing in the base address of the region. You'll get a dylib file path and a symbol name, although the symbol name tends to be junk.