Saturday, May 31, 2008

How To Drive CVS Totally Insane

Okay, I do have to admit: posting on why CVS sucks or how to make it slow is the lamest thing ever.  I mean, duh!  Everyone knows that CVS is 1970's technology, 3 parts dope to 1 part disco, and there are better alternatives.

But...I think the performance situation I found is interesting because of when it happens.  (In particular, this performance problem has been with us for months and I only noticed now.)

The airport database is a 40 MB, 800,000 line text file.  We checked it into CVS

Hey, stop laughing...it seemed like a good idea at the time!

What's interesting is that CVS can (or at least could) check revisions onto the main branch fast enough that we didn't realize we were getting into trouble, and check out the head branch at linear speeds, limited only by net connection.

Poking into the repository, it turns out that this is because the head-branch revision is stored in its entirety with diff instructions to move back a step each way toward the first revision. 

This goes a ways toward explaining why CVS doesn't become incrementally slower at getting code as you use it: in truth it does, but the cost is born on the older versions of the repository, not the newer ones!

Where we got into trouble was when I checked data into a branch.  The branch cannot be the head revision, so it is stored as a delta off the version it came from.  With a huge file, the cost of applying even one diff on the fly is quite large.

In our case, getting the head revision takes less than a minute; getting either the previous revision of this file or the branched latest version take over 45 minutes!

I am not yet sure what we can do about this - the fact that historical check-outs will be slow is annoying, but clearly we don't use that feature very often.  (Fortunately this is a rarely-changing file.)

Sunday, May 25, 2008

Pitfalls of Performance-Tuning OpenGL

Performance-profiling OpenGL is more difficult than performing a non-OpenGL application because:
  1. OpenGL uses a pipelined architecture; the slowest component will slow performance while other parts of the system go idle and
  2. You don't always have good insight into what's going on inside the OpenGL pipeline.
(Toward this second point there are tools like NVidia's PerfHUD and GLExpert that can tell you this, but your regular adaptive sampling profiler won't help you much.)

Pitfall 1: Not Being Opportunistic

The key to opportunistic performance tuning is to look at your whole application - creating a specialized "test case" to isolate performance problems can be misleading. For example, in X-Plane our break-down of main-thread CPU use might be roughly this:
  • Flight model/physics: 10%.
  • 2-d Panel: 5%.
  • 3-d World Rendering: 85%.
This is telling us: the panel doesn't matter much - it's a drop in the bucket. Let's say I make the panel code twice as fast (a huge win). I get a 2.5% performance boost. Not worth much. But if I make the world rendering twice as fast I get a 73% performance boost.

The naive mistake is to stub out the physics and world rendering to "drill down" into panel. What the profiler is saying is: don't even bother with the panel, there are bigger fish to fry.

Pitfall 2: Non-Realistic Usage and the Pipeline

The first pitfall is not OpenGL specific - any app can have that problem. But drill-down gets a lot weirder when we have a pipeline.

The key point here is: the OpenGL pipeline is as slow as the slowest stage - all other stages will go idle and wait. (Your bucket brigade is as slow as the slowest member.)

Therefore when you stub out a section of code to "focus" on another and OpenGL is in the equation, you do more than distort your optimization potential; you distort the actual problem at hand.

For example, imagine that the sum of the pane and scenery code use up all of the GPU's command buffers (one phase of the pipeline) but pixel fill rate is not used up. When you comment out the scenery code, we use less command buffers, the panel runs faster, and all of a sudden the panel bottlenecks on fill rate.

We look at our profiler and go "huh - we're fill rate bound" and try to optimize fill rate in the panel through tricks like stenciling.

When we turn the scenery engine back on, our fill rate use is even lower, but since we are now bottlenecked on command-buffers, our fill rate optimization gets us nothing. We were mislead because we optimized the wrong bottleneck. We saw the wrong bottleneck because the bottleneck changed when we removed a part of real use code.

One non-obvious case of this is framerate itself; high frame-rate can cause the GPU to bottle-neck on memory bandwidth and fill rate, as it spends time simply "flipping" the screen over and over again. It's as if there's an implicit full-screen quad drawn per frame; that's a lot of fill for very little vertex processing - as the ratio of work per frame to number of frames changes, that "hidden quad" starts to matter.

So as a general rule, load up your program and profile graphics in real-world scenario, not at 300 fps with too little work (or 3 fps with too much work).

Pitfall 3: Revealing New Bottlenecks

There's another way you can get in trouble profiling OpenGL: once you make the application faster, OpenGL load shifts again and new bottlenecks are revealed.

As an example, imagine that we are bottlenecked on the CPU (typical) but pixel fill rate is at 90% capacity, and other GPU resources are relatively idle. We go to improve CPU peformance (becaues it's the bottleneck) but as a result, we optimize too far and as a result we bottleneck completely on fill rate; we get to see only a fraction of our CPU work.

Now this isn't the worst thing; there will always be "one worst problem" and you can then optimize fill rate. But it's good to know how much benefit you'll get for an optimization; some optimizations are time consuming or have a real cost in code complexity, so knowing in advance that you won't get a big win is useful.

Therefore there is a case of stubbing that I do recommend: stubbing old code to emulate the performance profile of new code. This can be difficult -- for example, if you're optimizing CPU that emits geometry, you can't just stub the code or geometry goes down. But when you can find this case, you can get a real measurement of what kind of performance win is possible.

Friday, May 16, 2008

Musings on Weird UI Designs

I've written too many UI frameworks (where by "too many" I mean more than zero). You could shout at me that I should use a library, but the calculus that has led me to do this over and over is pretty inescapable:
  • An existing old huge code base uses an existing, old UI framework.
  • The whole UI framework is perhaps 100 kloc and will probably be about the same when renovations are done.
  • The application itself is 500 kloc, or maybe 3 mloc...applications get big.
  • The interface between the UI framework and application is "wide" because all of the app UI behavior code has to talk to the UI framework.
So given the choice of recoding 100 kloc of UI or touching a huge chunk of a 500 kloc app, what do you do? You go rework the UI framework so that you can preserve the existing UI.

That decision has been true in X-Plane twice now (and was true in previous, more conventional companies too). But with X-Plane I hit a particular, strange situation: no objects!

In a traditional user interface some kind of persistent data structure represents the on-screen constructs; typically this data structure might be an object hierarchy, so that message passing between objects can be used to resolve how user input is handled; polymorphic object behavior specializes the actions of each "widget" in the user interface.

X-Plane's normal user interface code, which derives from a more game-like approach, is quite a different beast.
  • Every user interface widget is implemented by a "function", called every frame, that both draws the widget and processes any user input events (that are essentially stored globally, e.g. "is the mouse down now", "what is the currently pressed key").
  • A dialog box is simply a series of function calls.
  • The function calls take as arguments layout information and what to do with the data, e.g. what floating point variable is "edited" by this UI element.
This creates very terse code, but a difficult problem when it comes to adding "state". There are no objects to add member variables to!

Working with such a weird construct has certainly made me "think different" about UI...it turns out that every feature I've wanted to add to X-Plane has been pretty straight forward, once I purged myself of the notion of objects.

State Without State

For X-Plane 9 I did some work on keyboard handling, providing real text editing in text widgets, keyboard focus, etc. The solution to implementing these features without state is:
  • A UI "widget function" (that is, the code drawing a particular instance of a widget) can calculate a unique ID for the widget on the fly based on the input data. Thus the function can tell "which" widget we are at any one time.
  • It turns out that usually the only state needed is state for a widget with focus, and which widget is focus. That is...all the variables turn out to be class-wide. (Remember, the data storage for the values of the widgets are already being passed into the functions.)
  • Some global infrastructure is needed to maintain the notion of focus, but that's true of any system, whether it's global or on the root window. (Since X-Plane has only one root window, the two are basically the same.)
I went into the project thinking that I was going to have to visit every single call sight of every single UI function...after working at this technique for about a day I realized that the scope of the problem was much bigger than I thought, went back, and did the above "dynamic identity" scheme.

This is sort of like the "fly-weighting" technique suggested in the Gang Of Four book, but the motivation is different. GoF suggest fly weighting for performance (e.g. 2 million objects are too slow). To me this is ludicrous...while I do think you need to optimize performance based on real profiling data, data organization for performance is fundamental for design. If you need to fly-weight your design to fix a performance problem, you screwed your design up pretty bad.

This is different - it's fly-weighting out of desperation - that is, to avoid rewriting a huge chunk of code.

Should that code later be refactored? I'm not sure. (The above design does make it possible though - you just make a new API on top of the old API that looks more like a conventional OOP system; once you've entirely migrated to the new API you can reimplement your widgets in a more conventional manner.) The conventional wisdom would be: yes, refactor, but in our case, I just don't see the win. I'm a strong advocate of continuous refactoring to improve future productivity. X-Plane's strange UI design scheme has not been a problem; Austin is exceedingly fast at writing new UI code and I wouldn't want to get in the way of that!

Wedge This In

WorldEditor and the scenery code come with a UI framework that's as close to my current thinking on how UI should be done as I'm ever going to get...some of the quirks include: all coordinates are window-relative (I believe that widget-relative coordinates don't save client code and make debugging harder) and course refreshes (on today's hardware it's not worth spending a lot of brain power trying to figure out what needs to be redrawn, especially when your artist is going to give you translucent elements that require the background to be redrawn anyway).

I am reworking the PlaneMaker panel editor right now; the code had reached a point of complexity where it needed to be refactored for future work, so I built a cheap & dirty version of the WED UI hierarchy to live inside X-Plane.

I tried something even stranger than the WED UI framework: the widget code in PlaneMaker doesn't persist the location of objects in any way! All of the hierarchy calls pass the location into the widget when it is called.

The initial coding of this was a little bit weird, but it actually has a strange silver lining: no resize-refresh problems!

When I wrote WED, I had to deal with a whole category of bugs where something has changed in the data model, and the right set of messages wasn't set up to tickle the UI to resize itself and redo the layout. With the PlaneMaker code this is never an issue because the data model is examined any time we do anything - our location decisions are always current.

The down-side of this is of course performance; a lot of code is getting called a lot more often than needed. I figure if this turns out to be a problem, I'll add caching on an as-needed basis and add notifications to match. One of the reasons this may be a non-issue is that PlaneMaker's panel editor edits a very small set of data; a single panel can have at most 400 instruments. WED can open 20,000 airports at once (um, don't try this at home), so the caching code had to be very carefully written to not have slow paths in hot loops.

Thursday, May 01, 2008

What Are My Threads Really Doing

I've blogged about how much I love Shark about four million times now.  But here's another cool Shark trick.

In the past I've focused on time profiles, which tell you where CPU and thread time are being spent, letting you understand (1) why your app isn't fast and (2) why a thread isn't getting things done in a timely manner.

You can also do a system call trace.  What's cool about a system call trace is that it picks up both the Kernel traps necessary to talk to the OpenGL driver and all of thread-synchronization primitives.

This is a timeline view of a system trace of X-Plane building a forest while you fly.  The top yellow bar is the main thread; the green icons are IOKit calls from the GL user-space driver into the Kernel - each one is a single batch being dispatched.  Shark gives you the synchronous time spent in each call (that is. how long did it take the GL to queue the command and return), and if any call causes a block, we can see it.

Below it the yellow and purple bar is a forest extrusion; the purple are VM-based zero-fill calls; the green icon on the end is a call to the GL to buffer up a VBO - turns out this causes zero fill. At the end of the bar we see a blue icon when the worker thread waits on the message queue for the next bit of work to process.

Apple provides another tool to view thread time: Thread Viewer.

On the right is a thread viewer of X-Plane on an 8-core Mac Pro.  The bottom solid green line is the main thread, running all the time.  The cascading green dots are asynchronous scenery generation (in this case forests are on maximum and the sim is running in a debug mode, which consumes more CPU).  As far as I can tell, they cascade because X-Plaen has built 8 worker threads, and each one gets the next work task in turn.  Up top are threads built by the sound manager to feed the audio hardware.

Wednesday, April 30, 2008

Triple-Boot Mac Part 2: MBRs Are Still Hell

A while ago I blogged about the setup process to get OS X 10.4, Windows XP, and Ubuntu 6.06 onto my MacBook Pro. I tried the same trick today on a Mac Pro running OS X 10.5, Windows Vista, and Ubuntu 8.04. To further complicate things, I created a true swap partition.

Now in the old way, Windows had to be last on the disk...fortunately Vista is a little bit more savvy about partitioning (although it stil uses MBR and not GPT partitions). So the order of operations was:
  1. Install rEFIt onto the Mac.
  2. Use diskutil to partition the main drive, cutting off three new Fat-32 partitions, size 30G, 30G, and 10G, which will become Vista, Linux and a swap partition.
  3. Install ext2FSX.
  4. Reformat the second Fat-32 partition as ext2 format. (Once ext2FSX is installed, Disk Utility will let you do this.)
  5. Install Ubuntu, using manual partitioning, and using the last Fat32 partition as a swap file.
  6. On reboot, Use rEFITs's partition tool to fix the MBR partition table, which the Ubuntu installer will trash.
  7. Install Vista onto the middle partition, which hasn't been used. Vista will want you to reformat the Fat-32 partition as NTFS, which is fine. One tricky part: since rEFIt is installed, when Vista reboots you have to pick the volume that the Vista installer thinks it is rebooting to. As far as I can tell, this is always the volume Vista is being installed on.
  8. Reboot with the Ubuntu Live CD, and run a grub shell and use "setup" to put a boot sector back on the Linux partition. I don't know why it was missing - possibly an error in setting up Ubuntu the first time.
Some notes on the sketchier steps:
  • Why ext2 on Mac? For some reason the 10.5 version of diskutil won't format Linux volume drives. The problem is that when the Ubuntu installer reformats the FAT-32 partition as ext2, OS X doesn't know about it. The next time you boot OS X, your Linux partition is mounted as a FAT-32 volume, which completely trashes it. It took me quite a few tries to figure this out, and I got myself into a state where I had to totally rebuild the startup drive from a full format and reinstall OS X itself. So converting the format to ext2 is a hack to keep the partition from getting smashed by OS X.
  • rEFIt is great - given a valid GPT partition table and a screwed up MBR table, it'll usually do what you want. It will prioritize the first four partitions, hence the Linux swap is the last partition, not Vista. (BTW Vista's partition tool sees the swap partition as "unused space" - so do not repartition in Vista.)
  • The goo to rebuild grub is more or less:
sudo grub
root (hd0,3)
setup (hd0,3)


Grub will complain a bit, but it will fix the MBR. Note that hd0,3 means the 4th partition on the first hard drive, so your mileage may vary. Tab completion in grub is a great way to scan the local files sytems, e.g. type root(hd to list drives and root(hd0, to list partitions.

Friday, April 25, 2008

What is OOP Good For?

An interesting artcle here got me thinking about about design and the development process. First of all, I agree completely with their 3 lies, particularly the second and third one. To riff on it a little bit:

Goals and Non-Goals

Code serves two purposes:
  1. It has to what you want (and this usually means: the features you want, the performance you want, no bugs).
  2. It has to not drive you insane.
This second point is of key importance: code design paradigms need to consider maintainability. Code always live longer than you think it will - you're going to be stuck with your work, so set up your code to be usable. I would say you can't achieve (1) or at least you can't achieve it cost-effectively, without (2).

However, what's conspicuous by absence is: the purpose of code is not to determine your data model! And in this regard OOP has done us a bit of a disservice in how we think about design.

OOP Myths

The problem with an OOP design is that it confuses code structure with data structure by providing a set of idioms and language constructs that bind the two together in very specific ways.

When I was in school, OOP was mis-advertised as providing three really great things:
  1. Encapsulation (that is, keeping the guts of your code from leaking all over the place).
  2. Polymorphism (that is, giving you language constructs to define multiple implementations of abstract behaviors).
  3. Code reuse (via inheritance, that is, you should be able to sub-class to utilize existing code).
Turns out that's not really true. Item 1 (encapsulation) is where 99.9% of the value of OOP comes from. The number one problem with big old code bases is that everything fondles everything else, and it's impossible to work in an environment like that. To the extent that OOP encapsulation encourages programmers not to write crappy code, that's a good thing, although OOP doesn't hvae a monopoly on this.

Polymorphism is a neat trick, but just not that important. I think we have all of one polymorphic class in all of X-Plane (plus one more in the installer). That's in close to half a million lines of code. Again, OOP doesn't have the monopoly - we get most of our dynamic behavior from function behaviors, not virtual function tables.

And code reuse via inheritance is so far off base that I'd call it a damned lie. Inheriting implementation usually results in chaos and is a bad idea, so I never believed this one.

But the idea that code design and data design serve different masters brings home exactly why this is a poor idea. Basically you are creating a very particular relationship among your data for the purpose of organizing your code! Bad idea! Data design should be driven by program needs, not code organization needs.

Design Method

When I work on X-Plane features, I think in terms of a data structure and an algorithm that go together to serve my purpose. Typically they are inseparable; it is necessary to organize the data in a certain way to run an algorithm with the performance characteristics we need. I would say you can't separate the two; is a hash table the structure, the algorithms for insert, search, etc. or both? I'd say both, because eithe one on their own are useless.

The actual "OOP" level code design is only determined after I have a clear picture of the data and algorithm design. Once we've decided that we're going to use a quad-tree for our scene graph (again, that's an algorithm and a data structure), then I go looking for language-specific idioms that will do this well.

(In this case we do use objects, but they aren't polymorphic, and in most cases they're just glorified structures.)

Refactor Early, Refactor Often

Finally, I always start a new feature in X-Plane with a clean work-surface by refactoring all of the code I will be touching as heavily as I can. You wouldn't know that the module in question wasn't version 1.0 (and this is before coding the new feature).

The advantage of this is that it cuts new-feature development time way down -- it's as if the app was custom coded to make the feature trivial. Most of the real work and development time is spent in the refactoring.

One reason to favor refactoring over new features for where to spend development time is that when refactoring, you can make a number of small changes and regression-test repeatedly. I find this goes a lot smoother than changing a huge amount of code and then discovering that multiple bugs causes nothing to work. With the small changes and regressions, bugs are caught a few at a time, making them easy to stomp.

By comparison, a new may be complex enough that it has to be coded fairly significantly to be able to test in useful ways. (The Rapid Development folks would yell at me for not having adequate test code.) Regardless of your testing diligence, you can't deny that when you start a new feature, there are already more ways to test the old one than the new one.

Thursday, April 24, 2008

The Über-Queue

In redesigning the background processing code for X-Plane, I am revisiting message queues.

Requirements on Queue Design
  • Anyone can be a worker thread. To be a worker thread, you simply pull off a task and execute it. (But see barriers below.) The important thing here is that we can have more than one worker thread for one giant queue, so that dispatch models CPU cores rather than categories of work.
  • Anyone can queue work - no requirement to be the main thread (just about any message queue design does this).
Problems Left for Client Code
  • Client code must not delete objects that are required by pending work tasks (scheduled or running). Doing so would be, um, bad.
  • Client code cannot halt running work tasks - doing this would introduce a huge amont of complexity for very little value.
  • Client code must use its own mechanism for getting "receipts" from finished tasks.
To this last point, typically while we want to schedule everybody in one big queue and have all hardware work on tasks, we want to dispatch our results to disparate locations in the app in a very client-code-specific way. In our case, we'll typically put confirmation code in the specific task objects, such as sending a message to our parent subsystem saying we're finished.

Queue Editing

While it's not strictly not strictly necessary, we provide an atomic editing operation on our message queue - basically a client can go in and reorder pending tasks. There are a few important uses for this:
  • It allows for prioritized queueing - I can queue a work task and then edit the queue to put my task first.
  • I can find "stale" work objects and remove them (replacing them with no-ops). Normally if I need to delete an object that is referenced by a pending task, I need to wait for the task (and have some way of knowing that the task has completed, specific to the task). But if I can clean the queue first I can remove unnecessary scheduled tasks and prevent work on soon-to-be-dead objects.
  • Being able to place a task in a specific place is useful for barriers.
Barriers

Barriers are actually implemented as clients on top of the scheduling system - their only requirement is to know the exact number of worker threads. (This imposes less generality on the worker thread pool.)

A barrier is implemented as a pair of message queues (or even semaphores, since all we care about is counting) and one work task for each thread.

We use the barrier to halt thread processing by inserting all of the work tasks into the queue; each one effectively signals back to us and holds up that worker. We thus wait for four receipts from the work tasks indicating that all four workers are now waiting on us. At this point, we are running and background processing is halted.

To resume processing, we send messages back to the worker threads and wait again. They acknowledge and are then finished. They send an acknowledge back to us because we may want to deallocate the resources used in the barrier and cannot do so until we know that all worker threads have woken up and finished their code.

We can use barriers when we need a true halt of processing; we can also use them as markers in the stream to indicate that we are done with a range of tasks.

If we want to wait for only one task to be done, we're better off having it signal us directly. The power of barriers is that, because they halt all processing, we can be sure that all tasks before us have finished. But the real purpose of barriers is to halt the processing of tasks that are behind us.