Saturday, May 31, 2008

How To Drive CVS Totally Insane

Okay, I do have to admit: posting on why CVS sucks or how to make it slow is the lamest thing ever.  I mean, duh!  Everyone knows that CVS is 1970's technology, 3 parts dope to 1 part disco, and there are better alternatives.

But...I think the performance situation I found is interesting because of when it happens.  (In particular, this performance problem has been with us for months and I only noticed now.)

The airport database is a 40 MB, 800,000 line text file.  We checked it into CVS

Hey, stop laughing...it seemed like a good idea at the time!

What's interesting is that CVS can (or at least could) check revisions onto the main branch fast enough that we didn't realize we were getting into trouble, and check out the head branch at linear speeds, limited only by net connection.

Poking into the repository, it turns out that this is because the head-branch revision is stored in its entirety with diff instructions to move back a step each way toward the first revision. 

This goes a ways toward explaining why CVS doesn't become incrementally slower at getting code as you use it: in truth it does, but the cost is born on the older versions of the repository, not the newer ones!

Where we got into trouble was when I checked data into a branch.  The branch cannot be the head revision, so it is stored as a delta off the version it came from.  With a huge file, the cost of applying even one diff on the fly is quite large.

In our case, getting the head revision takes less than a minute; getting either the previous revision of this file or the branched latest version take over 45 minutes!

I am not yet sure what we can do about this - the fact that historical check-outs will be slow is annoying, but clearly we don't use that feature very often.  (Fortunately this is a rarely-changing file.)

Sunday, May 25, 2008

Pitfalls of Performance-Tuning OpenGL

Performance-profiling OpenGL is more difficult than performing a non-OpenGL application because:
  1. OpenGL uses a pipelined architecture; the slowest component will slow performance while other parts of the system go idle and
  2. You don't always have good insight into what's going on inside the OpenGL pipeline.
(Toward this second point there are tools like NVidia's PerfHUD and GLExpert that can tell you this, but your regular adaptive sampling profiler won't help you much.)

Pitfall 1: Not Being Opportunistic

The key to opportunistic performance tuning is to look at your whole application - creating a specialized "test case" to isolate performance problems can be misleading. For example, in X-Plane our break-down of main-thread CPU use might be roughly this:
  • Flight model/physics: 10%.
  • 2-d Panel: 5%.
  • 3-d World Rendering: 85%.
This is telling us: the panel doesn't matter much - it's a drop in the bucket. Let's say I make the panel code twice as fast (a huge win). I get a 2.5% performance boost. Not worth much. But if I make the world rendering twice as fast I get a 73% performance boost.

The naive mistake is to stub out the physics and world rendering to "drill down" into panel. What the profiler is saying is: don't even bother with the panel, there are bigger fish to fry.

Pitfall 2: Non-Realistic Usage and the Pipeline

The first pitfall is not OpenGL specific - any app can have that problem. But drill-down gets a lot weirder when we have a pipeline.

The key point here is: the OpenGL pipeline is as slow as the slowest stage - all other stages will go idle and wait. (Your bucket brigade is as slow as the slowest member.)

Therefore when you stub out a section of code to "focus" on another and OpenGL is in the equation, you do more than distort your optimization potential; you distort the actual problem at hand.

For example, imagine that the sum of the pane and scenery code use up all of the GPU's command buffers (one phase of the pipeline) but pixel fill rate is not used up. When you comment out the scenery code, we use less command buffers, the panel runs faster, and all of a sudden the panel bottlenecks on fill rate.

We look at our profiler and go "huh - we're fill rate bound" and try to optimize fill rate in the panel through tricks like stenciling.

When we turn the scenery engine back on, our fill rate use is even lower, but since we are now bottlenecked on command-buffers, our fill rate optimization gets us nothing. We were mislead because we optimized the wrong bottleneck. We saw the wrong bottleneck because the bottleneck changed when we removed a part of real use code.

One non-obvious case of this is framerate itself; high frame-rate can cause the GPU to bottle-neck on memory bandwidth and fill rate, as it spends time simply "flipping" the screen over and over again. It's as if there's an implicit full-screen quad drawn per frame; that's a lot of fill for very little vertex processing - as the ratio of work per frame to number of frames changes, that "hidden quad" starts to matter.

So as a general rule, load up your program and profile graphics in real-world scenario, not at 300 fps with too little work (or 3 fps with too much work).

Pitfall 3: Revealing New Bottlenecks

There's another way you can get in trouble profiling OpenGL: once you make the application faster, OpenGL load shifts again and new bottlenecks are revealed.

As an example, imagine that we are bottlenecked on the CPU (typical) but pixel fill rate is at 90% capacity, and other GPU resources are relatively idle. We go to improve CPU peformance (becaues it's the bottleneck) but as a result, we optimize too far and as a result we bottleneck completely on fill rate; we get to see only a fraction of our CPU work.

Now this isn't the worst thing; there will always be "one worst problem" and you can then optimize fill rate. But it's good to know how much benefit you'll get for an optimization; some optimizations are time consuming or have a real cost in code complexity, so knowing in advance that you won't get a big win is useful.

Therefore there is a case of stubbing that I do recommend: stubbing old code to emulate the performance profile of new code. This can be difficult -- for example, if you're optimizing CPU that emits geometry, you can't just stub the code or geometry goes down. But when you can find this case, you can get a real measurement of what kind of performance win is possible.

Friday, May 16, 2008

Musings on Weird UI Designs

I've written too many UI frameworks (where by "too many" I mean more than zero). You could shout at me that I should use a library, but the calculus that has led me to do this over and over is pretty inescapable:
  • An existing old huge code base uses an existing, old UI framework.
  • The whole UI framework is perhaps 100 kloc and will probably be about the same when renovations are done.
  • The application itself is 500 kloc, or maybe 3 mloc...applications get big.
  • The interface between the UI framework and application is "wide" because all of the app UI behavior code has to talk to the UI framework.
So given the choice of recoding 100 kloc of UI or touching a huge chunk of a 500 kloc app, what do you do? You go rework the UI framework so that you can preserve the existing UI.

That decision has been true in X-Plane twice now (and was true in previous, more conventional companies too). But with X-Plane I hit a particular, strange situation: no objects!

In a traditional user interface some kind of persistent data structure represents the on-screen constructs; typically this data structure might be an object hierarchy, so that message passing between objects can be used to resolve how user input is handled; polymorphic object behavior specializes the actions of each "widget" in the user interface.

X-Plane's normal user interface code, which derives from a more game-like approach, is quite a different beast.
  • Every user interface widget is implemented by a "function", called every frame, that both draws the widget and processes any user input events (that are essentially stored globally, e.g. "is the mouse down now", "what is the currently pressed key").
  • A dialog box is simply a series of function calls.
  • The function calls take as arguments layout information and what to do with the data, e.g. what floating point variable is "edited" by this UI element.
This creates very terse code, but a difficult problem when it comes to adding "state". There are no objects to add member variables to!

Working with such a weird construct has certainly made me "think different" about UI...it turns out that every feature I've wanted to add to X-Plane has been pretty straight forward, once I purged myself of the notion of objects.

State Without State

For X-Plane 9 I did some work on keyboard handling, providing real text editing in text widgets, keyboard focus, etc. The solution to implementing these features without state is:
  • A UI "widget function" (that is, the code drawing a particular instance of a widget) can calculate a unique ID for the widget on the fly based on the input data. Thus the function can tell "which" widget we are at any one time.
  • It turns out that usually the only state needed is state for a widget with focus, and which widget is focus. That is...all the variables turn out to be class-wide. (Remember, the data storage for the values of the widgets are already being passed into the functions.)
  • Some global infrastructure is needed to maintain the notion of focus, but that's true of any system, whether it's global or on the root window. (Since X-Plane has only one root window, the two are basically the same.)
I went into the project thinking that I was going to have to visit every single call sight of every single UI function...after working at this technique for about a day I realized that the scope of the problem was much bigger than I thought, went back, and did the above "dynamic identity" scheme.

This is sort of like the "fly-weighting" technique suggested in the Gang Of Four book, but the motivation is different. GoF suggest fly weighting for performance (e.g. 2 million objects are too slow). To me this is ludicrous...while I do think you need to optimize performance based on real profiling data, data organization for performance is fundamental for design. If you need to fly-weight your design to fix a performance problem, you screwed your design up pretty bad.

This is different - it's fly-weighting out of desperation - that is, to avoid rewriting a huge chunk of code.

Should that code later be refactored? I'm not sure. (The above design does make it possible though - you just make a new API on top of the old API that looks more like a conventional OOP system; once you've entirely migrated to the new API you can reimplement your widgets in a more conventional manner.) The conventional wisdom would be: yes, refactor, but in our case, I just don't see the win. I'm a strong advocate of continuous refactoring to improve future productivity. X-Plane's strange UI design scheme has not been a problem; Austin is exceedingly fast at writing new UI code and I wouldn't want to get in the way of that!

Wedge This In

WorldEditor and the scenery code come with a UI framework that's as close to my current thinking on how UI should be done as I'm ever going to get...some of the quirks include: all coordinates are window-relative (I believe that widget-relative coordinates don't save client code and make debugging harder) and course refreshes (on today's hardware it's not worth spending a lot of brain power trying to figure out what needs to be redrawn, especially when your artist is going to give you translucent elements that require the background to be redrawn anyway).

I am reworking the PlaneMaker panel editor right now; the code had reached a point of complexity where it needed to be refactored for future work, so I built a cheap & dirty version of the WED UI hierarchy to live inside X-Plane.

I tried something even stranger than the WED UI framework: the widget code in PlaneMaker doesn't persist the location of objects in any way! All of the hierarchy calls pass the location into the widget when it is called.

The initial coding of this was a little bit weird, but it actually has a strange silver lining: no resize-refresh problems!

When I wrote WED, I had to deal with a whole category of bugs where something has changed in the data model, and the right set of messages wasn't set up to tickle the UI to resize itself and redo the layout. With the PlaneMaker code this is never an issue because the data model is examined any time we do anything - our location decisions are always current.

The down-side of this is of course performance; a lot of code is getting called a lot more often than needed. I figure if this turns out to be a problem, I'll add caching on an as-needed basis and add notifications to match. One of the reasons this may be a non-issue is that PlaneMaker's panel editor edits a very small set of data; a single panel can have at most 400 instruments. WED can open 20,000 airports at once (um, don't try this at home), so the caching code had to be very carefully written to not have slow paths in hot loops.

Thursday, May 01, 2008

What Are My Threads Really Doing

I've blogged about how much I love Shark about four million times now.  But here's another cool Shark trick.

In the past I've focused on time profiles, which tell you where CPU and thread time are being spent, letting you understand (1) why your app isn't fast and (2) why a thread isn't getting things done in a timely manner.

You can also do a system call trace.  What's cool about a system call trace is that it picks up both the Kernel traps necessary to talk to the OpenGL driver and all of thread-synchronization primitives.

This is a timeline view of a system trace of X-Plane building a forest while you fly.  The top yellow bar is the main thread; the green icons are IOKit calls from the GL user-space driver into the Kernel - each one is a single batch being dispatched.  Shark gives you the synchronous time spent in each call (that is. how long did it take the GL to queue the command and return), and if any call causes a block, we can see it.

Below it the yellow and purple bar is a forest extrusion; the purple are VM-based zero-fill calls; the green icon on the end is a call to the GL to buffer up a VBO - turns out this causes zero fill. At the end of the bar we see a blue icon when the worker thread waits on the message queue for the next bit of work to process.

Apple provides another tool to view thread time: Thread Viewer.

On the right is a thread viewer of X-Plane on an 8-core Mac Pro.  The bottom solid green line is the main thread, running all the time.  The cascading green dots are asynchronous scenery generation (in this case forests are on maximum and the sim is running in a debug mode, which consumes more CPU).  As far as I can tell, they cascade because X-Plaen has built 8 worker threads, and each one gets the next work task in turn.  Up top are threads built by the sound manager to feed the audio hardware.