Friday, December 22, 2006

Garbage Collection - Memory Management by Negligence

I realize that calling garbage collection "negligent" memory management isn't really fair. But
I've heard enough people argue that garbage collection is the cure for the disease that is C++ memory management bugs, e.g. arguments like these.

The classic C++ response to "new/delete makes bugs" is "manual memory management is fast." I'm not sure I would agree with this. I would say that C++ gives a programmer the flexibility to pick a memory management strategy and tune it to be fast for a given application. But I will argue three other reasons why I would rather have explicit than garbage collected memory management:
  1. Reproducible deallocation paths. When we get a bug in X-Plane where memory has been trashed, an object has been freed, or some other memory-related bug, the most important thing for us is that the bug be reproducible. If the sim employed generalized garbage collection, then a whole catagory of unrelated behavior would potentially introduce when objects are allocated/destroyed. I would even argue that garbage collection breaks encapsulation by allowing the behavior of objects to be influenced by unrelated subsystems in surprising ways (since they are all linked through a garbage collector).
  2. Explicit description of memory-allocation. One thing I like about X-Plane is that I can see where we deallocate memory. Each deallocation is programmed*. If I find a buggy deallocation, I can trace it back to an intended deallocate and then ask "what did I mean by this".
  3. Explicit memory management means programmers thinking about memory management. What I would argue is that you can make all the same kinds of mistakes in a garbage-collected system as you can in an explicit system, e.g. by making circular loops of objects, etc. But no one ever said "when you use new/dispose, just relax and don't think about memory - it'll just work".
*Not necessarily programmed by new/dispose - I am all in favor of building up abstractions around memory management - sometimes even garbage collection.

Okay now I've contradicted myself. I suppose a more fair statement would be that memory management strategies have implications. A programmer should pick a strategy for a given problem and realize that it's a design decision with trade-offs. Picking garbage collection has good and bad things about it, but like most design patterns, it is not appropriate for all code (and I would even say it's not appropriate for all OO code) and while it makes some things easier, it makes other things harder.

Thursday, December 14, 2006

Instrumentation

nVidia has a very cool tool called NVPerfHUD - it's an application that provides on-screen diagnostics and debugging for graphics-intensive applications. Unfortunately for us it has two problems:
  1. It's Windows only and we do 99% of X-Plane development on Macs.
  2. It's nVidia only and we have more ATI hardware in our Macs than nVidia. (Not our fault - that's what Apple ships!)
Fortunately (and typically for an application that's gone through 8 major revisions) X-Plane already has a lot of these things built right into the app. When working on a long-term code base, the investment in built-in diagnostic code is well worth it...perhaps these will give you some ideas on how to add instrumentation to your application.

All of X-Plane's instrumentation is zero-overhead when not used, and relatively low overhead when used, and it ships in the final application. We do this because we can, and also because it allows us to debug in-field apps without having to send out special builds.

Stats Counters
X-Plane uses the plugin dataref system to export a series of private stats counters to a diagnostic plugin for on-screen analysis. The stats counters show everything from the number of cars drawn to the number of segments of the planet view that are rendered.

Stats counters give us a better picture of the internal state of the application. If a user reports slower framerate, the stats counters can help us tell why. Is it because we're drawing too many cars, or because the planet is being drawn.

Art Tuning
We also use datarefs to export a series of tuning values for our artists. They can adjust the overall look of lights, cars, the propeller, etc. via these variables. This lets them work in real time, tuning the sim and seeing changes immediately. Once they reach values they like, we set them as the defaults in the sim.

Perf Flags
OpenGL is a pipeline - if any stage of that pipeline slows down, your framerate sinks. So in order to figure out why X-Plane is slow, we need to know which part of the pipeline is overloaded. To that end we have a series of performance flags (again datarefs) that can be set to intentionally change loading of the pipeline. This is an idea inspired by NVPerfHUD, but implemented directly in our engine.
  • One flag will turn off the flight model, lowering CPU load.
  • One flag will change the clip volume, limiting the amount of vertex processing (and all that follows).
  • Another flag will replace all textures with a 2x2 proxy, relieving pressure on AGP badwidth and in-card VRAM memory bandwidth.
FPS Test
X-Plane ships with a command-line based framerate test. The framerate test controls all sim settings and automatically logs framerate. The framerate test gives us an easy way to regress new code and make sure we haven't hurt performance. It also gives us a definite way to assess the performance of machines in the field.

Hidden Commands
X-Plane exports some hidden commands via the plugin system. (You must have our internal plugin to use them right now.) For example, all pixel shaders can be reloaded from disk without rebooting the sim, which speeds up the development cycle a lot. This kind of functionality is built right into our engine - our shader object understands reloading.

Compiler Flags
A few more invasive debugging techniques require #defines to be flipped inside the sim. This includes a lot of logging options (all that file output kills framerate, so we don't even mess with leaving this on or piping it into the bit bucket) which let us really see what's going on all the way down the scene graph. We can also turn on things like wire frames and stepped drawing (drawing the frame one part at a time and swapping the buffer to see the results).

Adaptive Sampling Profiler
The last tool we have is remote scripting of Shark, Apple's adaptive sampling profiler, via a plugin. I can't say enough good things about Shark, it's just a really great tool. Via the plugin system we can script Shark profiling, giving us very accurate profiling of specific blocks. This stuff is normally off and has to be #defined on, since it's a bit invasive (e.g. when we have Shark attached we don't want to profile every single part of the app, because we'll spend all our time waiting for Shark to process the captured samples).

If there's a moral to the story, I suppose it's that it only takes a few more minutes to change a hacked up, temporary, one-off debugging facility into a permanent, reusable, scalable, clean debugging facility, but you get payback every time you work on the codebase. And the payoff for writing code that's designed for analysis and debugging from day one (especially for OpenGL, where so much of the subsystem is opaque, and bugs usually manifest as a black screen) is even greater.

Tuesday, December 12, 2006

Hemophiliac Code

I managed to slice myself pretty thoroughly while trying to make bagel chips tonight. Besides my surprise both at how deep the cut was and how stupid I am, I had another thought tonight as I type, with my thumb in a bandaid but otherwise working normally: my thumb's self-repair system works really really well.

Compare that to a piece of code. You're running along in a happy function and you hit a null object pointer. But you're really supposed to call that method, unconditionally. What to do? Call it and we bus error. Don't call it and, well, we've defied the logic of the program!

The advantage my thumb has over my code is that it knows pretty much what the right thing to do is under certain (predictable) problem conditions. Blood exposed to open air...probably we've been cut - let's clot. (This is similar to a pilot experiencing an engine failure. It's not good, but it's not unexpected, so it's possible to respond in a way that will maximize the chance for success.)

Given that there is a whole catagory of code defects that we can detect but cannot hope to repair, most programmers take the opposite approach: if we can't hope to survive damage, let's make sure we die every single time! The logic is, better to know that we're getting injured, even if the symtom is the program dying in the lab, than to have unknown damage under the surface that will cause death in the field.

Perhaps a reasonable approach would be, "die early, die often". We never want to have an internal defect and not report it, and we want to report it as early as possible, as that's when we can do the best job of reporting it. Early detection is a good thing in debugging.

Early detection has become even more important in X-Plane as we start to thread our code. To utilize dual-core hardware, we do some of the CPU-intensive work of constructing our 3-d scenery details on the second core. The main thread farms this to a worker thread, who then tosses it back to the main thread to insert into the scene graph between frames.

The problem is: if something goes wrong during scene-graph insertion, we really don't have any idea why. We don't know who called us, because we've just got a chunk of finished geometry (and they all look the same) and the actual code that did the work exited long ago, leaving no call-stack.

Early detection is thus a huge benefit. If we can get our failure on the worker thread as the instantiation happens (rather than later as we edit the scene graph) then we can break into the debugger and play in a wonderland of symbols, local variables, and data.

(Final off topic thought: why is this code bad? Hint: it's not the algorithm that's bad.)

inline float sqr(float x) { return x*x; }
inline float pythag(float x, float y, float z) {
return pthag(sqr(x)+sqr(y)+sqr(z); }
float angle_between(float vec1[3], float vec2[3])
{
float l1=pythag(vec1[0],vec1[1],vec1[2]);
float l2=pythag(vec2[0],vec2[1],vec2[2]);
if(l1 != 0.0) l1 = 1.0 / l1;
if(l2 != 0.0) l2 != 1.0 / l2;
float v1[3] = { vec1[0] * l1,vec1[1] * l1,vec1[2] * l1};
float v2[3] = { vec2[0] * l2, vec2[1] * l2, vec2[2] * l2 };
float dot = v1[0]*v2[0]+v1[1]*v2[1]+v1[2]*v2[2];
return acos(dot) * 180.0 / PI;
}

Monday, December 11, 2006

Intrinsic Linked Lists for Static Construction

This is another excuse to make sure the blogger move to beta hasn't killed all m blogs. In the past I ranted about not being able to move to blogger beta (this blog moved, the others did not). Now I am happily united entirely on the beta blogger...web bliss is mine. (It would be nice if
the old posts listed the correct authors, but that's what we get for flirting with WordPress.)

A while ago I wrote a lot trying to explain how the hell global static construction works in C++.
The best simple summary I can give you is:
  • It doesn't do what you think.
  • Your code will explode in ways you didn't expect.
I also tried to explain the joys of intrinsic linked lists, that is structs that contain a next pointer. (Don't pass these up for the STL - sometimes the old-school technique works better, especially when the issue isn't O(n) but how good your implementation is. Are you sure your app isn't bottlenecked by memory allocation?)

Like peanut and chocolate, these two ideas go well together. That is...static construction problems can be fixed by using intrinsic linked lists. This code is guaranteed to make your life hell:

class foo {
static set all_of_me;
foo() { all_of_me.insert(this); }
~foo() { all_of_me.erase(this); }
// more tuff
};

The idea is that the foo class self-tracks all its members...this is all good until you do this in some other CPP file besides the CPP where all_of_me is defined.

static foo this_static_var_will_kill_me;

First the solution, then why it works. The solution is simply this:
class foo {
static foo * first_of_me;
foo * next;
foo() { this->next = first_of_me; first_of_me = this; }
// more stuff
};

Now foo uses an intrinsic linked list instead of an STL set and we can declare static global foo objects all the time!

Analysis:
  • The problem with static construction is that C++ doesn't guarantee that static global objects will be initialized in any particular order.
  • When a class has a static member variable that is in turn a complex class, that static member variable is effectively a static global object, and thus it will be constructed in no particular order.*
  • In the case of our "foo" example, if the particular foo object is constructed before the set "all_of_me" is constructed, then foo's constructor will try to stick "this" into a set whose contents are probably entirely zero.
  • If that doesn't crash then when the set is constructed (after the foo object) the set will be fully reinitialized, causing our object to be lost.
(For this reason we hope for the crash - unfortunately some STL containers like vector will often fail quietly when they are used before initialization, as long as they're zero'd out first.)

The beauty of intrinsic lists is: C++ guarantees it will do all of the static initialization (that is, zero-setting, etc.) before any functions or methods are called. So we can be sure our list head pointer is NULL, and that's all we need to get started.

One final note - as I define catagories for this post I see Chris has catagories for both "rants" and "IIS". I can't imagine that you'd ever want IIS without the rant tag.

* Okay, so there are some rules on construction order. Trust me, they won't help you do anything productive!

Tuesday, November 28, 2006

Returning an Array in C - the Least of 3 Evils

In C++ you can write things like this:
void get_some_strings(vector& out_strings)
{
out_strings.clear();
for (some loop)
out_strings.push_back(whatever);
}
Easy! How do we do this if our API has to be pure C? Well, there are three options.

First you could always allow the function to allocate memory that's deallocated by the caller. I don't like this option at all. To me, it's a walking memory-leak waiting to happen. You'll have to check for memory leaks every time you use the function and there are no compile-time mechanisms to encourage good programming technique.

A second option is to pass in some preallocated buffers. Hrm - I think I tried that once. Unfortunately as you can read here, the technique is very error prone and it took Sandy (the better half of the SDK) to clean up the mess I made.

The third option, and least evil IMO is to use a callback. Now we have something like this:
void get_some_strings(void (* cb)(const char * str, void * ref), void * ref)
{
for (some stuff)
cb(some string, ref);
}
If you can tollerate the C function callbacks (have a few Martinis - they'll look less strange) this actually provides a very simple implementation, which means less risk of memory-management bugs.

On the client side if you're using this library from C++ you can even turn this back into the vector code you wanted originally like this:
void vector_cb(const char * str, void * ref)
{
vector * v = (vector*)ref;
v->push_back(str);
}
Now I can use this one callback everywhere like this:
vector my_results;
get_some_stuff(vector_cb,&my_results);
What I like about the callback solution is that both the client code and implementation code can be written in a simple and consise manner, which translates into solid code.

Tuesday, November 07, 2006

Surprise - pthread_cond_timedwait takes an absolute time!

Whoever reads the man pages anyway? Turns out pthread_cond_timedwait takes an absolute time! In otherwords, if you want it to sleep for 1 second, you have to pass one second more than the current time, as returned by gettimeofday (yuck) and converted from a timeval to a timespec (double yuck).

As much as I gripe about this because most threading APIs take relative timeouts (as does select), there actually is a use for this.

When writing a thread-safe message queue you might write something like this to read the queue:
lock critical section.
increment waiting thread count.
while(we don't have a message)
if(pthread_cond_timedwait(condition var, timeout)==ETIMEOUT)
decrement thread count, unlock, return "timed out"
decrement waiting thread count
read message out of queue
unlock critical section
Now...you might wonder, why do we need a while loop? The answer is: it's possible that between the time that our thread is woken up (vai the condition variable wait) due to a message being queued and the time we actually run, get the lock, and continue execution, another thread could have come along and stolen our message. (Note that another thread can go through this code without ever calling pthread_cond_timedwait if there is already a message, which is good for performance. This is not a FIFO message queue!) Thus we have to while-loop around until we reach a point where we've woken up, acquired the lock, and the message is still there. Once we have that lock, the message isn't going anywhere and we can safely exit the loop.

This is where the absolute time is handy - we might go around the loop 6 times. But the "deadline" - the time after which there's no point in waiting, we should return to the user, is an absolute time, and can be invariant across the loop. Thus there is no need to measure how much time has gone by on each loop and decrement our wait time.

Tuesday, October 10, 2006

OpenGL Fogging Artifacts

Here's a video of a very very low visibility approach in X-Plane. Note how the "fog" (that is, the mixing of the runway and ground to gray) pulses in and out as we fly, and they don't do it at the same time. What's going on here?



What you're seeing is a defect in the fixed-function pipeline. The problem is two-fold:
  1. OpenGL implementations are allowed to calculate fog colors at vertices and do a simple interpolation between the vertices.
  2. The vertices that we interpolate between are not necessarily the corners of your triangle; they could be the vertices that OpenGL adds when it clips your triangle to the view frustum.
So we have two sets of artifacts at once. First consider the case of the ground and runways. Since the fogging "interval" (the distance between fog = 0% and fog = 100%) is quite small here, the same amount of fog is spread along the entirity of a runway triangle (about 50 meters deep) and a ground mesh triangle (at least 90 meters deep, but possibly up to 1 km deep). That means that we go from visible to fog much faster over the runway than over the ground.

As we fly, the actual size of the mesh triangles is changing, as part of each mesh triangle scrolls off screen. This in turn affects the gradient of how fast we fog and what the corner fog colors are.

The results are, well, that video: fog doens't match between the runways and the ground, and the particular strange results vary as we fly.

The solution is, like all things in life, to replace the fixed-function pipeline with a pixel shader. The pixel shader can then use a per-fragment value (like the depth value) to fog. This is more expensive (well, probably not really...we have the depth value around and it's the same number of DSP ops) but will produce consistent fog across the entire area.