The Hacks of Life

Friday, April 23, 2010

CGAL: It's All About the Mantissa

In a past post I described CGAL as having no rounding errors. It does this by using number types of variable size (using dynamically allocated memory per number!) so that it never runs out of digits. (It also maintains the numerator and denominator of fractions separately to avoid problems with repeating decimals.)

The advantage of this is that geometric algorithms that rely on precise calculations never go haywire due to rounding errors. For example, when using fixed-precision math (e.g. IEEE floats) the intersection of two near-parallel lines will be calculated inaccurately - sometimes with the intersection showing up miles from the original lines. CGAL always has more precision, so it avoids this problem.

But there is one down-side: when you perform a series of intersections, the result is exact numbers whose mantissas (the number of actual digits) have grown very long. And CGAL won't blink about making them even longer as you do more calculations.

Instead CGAL will become insanely slow.

I hit this case the other day. The first piece of processing I do is to combine a whole pile of vector data from OSM into one integrated map. While OSM is not particularly high precision (from a bits standpoint) the resulting intersecting points are calculated "perfectly", sometimes with very large mantissas.

I then wrote a piece of code to take a city block from that OSM map and perform some calculations to find the sidewalk calculation. The problem: the four corners of the city block were already very long numbers since they were the result of a CGAL calculation. Thus a long calculation on a long calculation becomes very slow.

The original algorithm took about 36 minutes for a fully optimized build to find all sidewalks in San Diego. That is way too slow, and unusable for our project.

I the put a rounding stage in: fore each corner of the block, I would convert it to a regular 64-bit IEEE float and then back to CGAL, throwing out any "extra" precision that CGAL was saving. Note that the 64-bit float already gives me better than 1 millimeter precision, which is more than overkill for a road. The algorithm run on the "simplified" data ran in 67 seconds.

Now there is one danger: if, due to mismatched road locations in OSM or conflicting edits, some of the "blocks" were really tiny (less than 1 mm) CGAL would have correctly built that block using infinite precision, and my "rounding" would have incorrectly reshaped those blocks, perhaps turning them inside out or in some other way damaging them.

So a necessary step to productizing this 'resolution reduction' is to do a sanity check on each resulting block. Fortunately most of the time if the block contains too-small-to-use data, we don't need the data in the first place.

Wednesday, April 21, 2010

Constitutional Opposition

One part of this post by Daring Fireball on the iPhone SDK licensing agreement made me chuckle:

If you are constitutionally opposed to developing for a platform where you’re expected to follow the advice of the platform vendor, the iPhone OS is not the platform for you. It never was. It never will be.

It inspired me to come up with a new quotable:

If you are constitutionally opposed to developing for a platform where you’re expected to follow the advice of the platform vendor, you should not be a computer programmer.

See also basically every post by Raymond Chen: "just because, in Win98SE2, you could call SomeRandomWin32API with a combination of NULL, -1, and Bill Gate's IQ and get an undocumented behavior that violates all of Microsoft's guidelines for applications development doesn't mean it will continue to work in Windows 7."

Thank You Jeeves, That Will Be All

The other day I went in to discover why a new piece of scenery code had mysteriously stopped working. Eventually I came to this:

(p,path.size()/2,def,degree,inExtrudeFunc,
inObjectFunc,inChecker,ag_mode_draped_obj);

Ah! Now it all makes sense. The code should have read:

AG_extrude_string(p,path.size()/2,def,degree,inExtrudeFunc,
inObjectFunc,inChecker,ag_mode_draped_obj);

After having done a global search, clearly I had hit the space bar by accident, nuking my function call. The charming thing is that C++ doesn't question why I have a giant list of paranthetical "stuff", it just blissfully compiles it into an expression that does...well, pretty much nothing.

Some of my other favorite C++ isms:

case a: do_it(); break;
b: do_x(); break; // no case, not illegal - now "b'' is a label!
defaultl: do_more(); break; // typo in default? That's a label too!

Of course we are all familiar with the fun that emerges from swapping = and ==. And having a stray semi-colon never hurt anything.

Propsman had an apt characterization: C++ is like an overly polite butler. "A...hamburger on the rocks, Sir? Certainly, Sir, I'll bring you one directly..."

Tuesday, March 16, 2010

Santa to Ben: You're An Idiot

A few months ago I posted a request (to Santa) for parallel command dispatch. The idea is simple: if I am going to render several CSM shadow map levels and the scene graph contained in each does not overlap, then each one is both (1) independent in render target and (2) independent in the actual work being done. Because the stuff being rendered to each shadow map is different, using geometry shaders and multi-layer FBOs doesn't help.* My idea was: well I have 8 cores on the CPU - if the GPU could slurp down and run 8 streams, it'd be like having 8 independent GLs running at once and I'd get through my shadow prep 8x as fast.

I was talking with a CUDA developer and finally I got a clue. The question at hand was whether CUDA actually runs parallel kernels. Her comment was that while you can queue multiple kernels asynchronously, the goal of such a technique is to keep the GPU busy - that is, to keep the GPU from going idle between batches of kernel processing. The technique of multiple kernels isn't necessary to keep the GPU fully busy, because even iwth hundreds of shader units, the kernel is going to run over thousands or tens of thousands of data points. That is, CUDA is intended for wildly parallel processing, so the entire swarm of "cores" (or "shaders"?) is still smaller than the number of units of work in a batch.

If you submit a tiny batch (only 50 items to work over) there's a much bigger problem than keeping the GPU hardware busy - the overhead of talking to the GPU at all is going to be worse than the benefit of using the GPU. For small numbers of items, the CPU is a better bet - it has better locality to the rest of your program!

So I thought about that, then turned around to OpenGL and promptly went "man am I an idiot". Consider a really trivial case: we're preparing an environment map, it's small (256 x 256) and the shaders have been radically reduced in complexity because the environment map is going to be only indirectly shown to the user.

That's still at least 65,536 pixels to get worked over (assuming we don't have over-draw, which we do). Even on our insane 500-shader modern day cards, the number of shaders is still much smaller than the amount of fill we have to do. The entire card will be busy - just for a very short time. (In other words, graphics are still embarrassingly parallel.)

So, at least on a one GPU card, there's really no need for parallel dispatch - serial dispatch will still keep the hardware busy.

So...parallel command dispatch? Um...never mind.

This does beg the question (which I have not been able to answer with experimentation): if I use multiple contexts to queue up multiple command queues to the GPU using multiple cores (thus "threading the driver myself") will I get faster command-buffer fill and thus help keep the card busy? This assumes that the card is going idle when it performs trivially simple batches that require a fair amount of setup.

To be determined: is the cost of a batch in driver overhead (time spent deciding whether we need to change the card configuration or real overhead (e.g. we have to switch programs and GPU isn't that fast at it). It can be very hard to tell from an app standpoint where the real cost of a batch lives.

Thanks to those who haunt the OpenGL forums for smacking me around^H^H^H^H^H^H^H^H^Hsetting me straight re: parallel dispatch.

* geometry shaders and multilayer FBO help, in theory, when the batches and geometry for each rendering layer are the same. But for a cube map if most of the scene is not visible from each cube face, then the work for each cube face is disjoint and we are simply running our scene graph, except now we're going through the slower geometry shader vertex path.

Wednesday, March 10, 2010

The Value Of Granularity

OpenGL is a very leaky abstraction. It promises to draw in 3-d. And it does! But it doesn't say a lot about how long that drawing will take, yet performance is central to GL-based games and apps. Filling in this gap is transient information about OpenGL and its current dominant implementations that isn't easy to come by - it comes from a mix of insight from other developers, connecting the dots, reading many disparate documents, and direct experimentation. This isn't easy for someone who isn't working full time as an OpenGL developer, so I figure there may be some value to blogging things I have learned the hard way about OpenGL while working on X-Plane.

OpenGL presents new functionality via extensions. (It also presents new functionality via version numbers, but the extensions tend to range ahead of the version numbers because the version number can only be bumped when allrequired extensions are available.) When building an OpenGL game you need a strategy for coping with different hardware with different capabilities. X-Plane dates back well over a decade, and has been using OpenGL for a while, so the app has had to cope with pretty much every extension being not available at one point or another.

Our overall strategy is to categorize hardware into "buckets". For X-Plane 9 we have 2.5 buckets:

Pre-shader hardware, running on a fixed function pipeline.
Modern shader enabled hardware, using shaders whenever possible.
We have a few shaders that get cased off into a special bucket for the first-gen shader hardware (R300, NV25), since that hardware has some performance and capability limitations.

These buckets then get sliced up by features the user select, but these don't complicate the buckets - we simply make sure we can shade without per pixel lighting, for example, if the user wants higher framerate.

So here is what has turned out to be surprising: we were basically forced to allow X-Plane to run with a very granular set of extensions for debugging purposes. An example will ilWhat lustrate.

Using the buckets strategy you might say: "The shader bucket uses GLSL, FBOs, and VBOs. Any hardware in that category has all three, so don't write any code that uses GLSL but not FBOs, or GLSL but not VBOs." The idea is to save coding by reducing the combination of all possible OpenGL hardware (we have eight combos of these three extensions) to only two combinations (have them all, don't have them all).

What we found in practice was that being able to run in a semi-useful state without FBOs but with GLSL was immensely useful for in-field debugging. This is not a configuration we'd ever want to really support or use, but at least during the time period that we started using FBOs heavily, the driver support for them was spotty on the configurations we hit in-field. Being able to tell a user to run with --no_fbos was an invaluable differential to demonstrate that a crash or corrupt screen was related specifically to FBOs and not some other part of OpenGL.

As a result, X-Plane 9 can run with any of these "core" extensions in an optional mode: FBOs, GLSL, VBOs (!), PBOs, point sprites, occlusion queries, and threaded OpenGL. That list matches a series of driver problems we ran across pretty directly.

Maintaining a code base that supports virtually every combination is not sustainable indefinitely, and in fact we've started to "roll up" some of these extensions. For example, X-Plane 9.45 requires a threaded OpenGL driver, whereas X-Plane 9.0 would run without it. We remove support for individual extensions going missing when tech support calls indicate that "in field" the extension is now reliable.

At this point it looks like FBOs, threaded OpenGL, and VBOs are pretty much stable. But I believe that as we move forward into newer, weirder OpenGL extensions, we will need to keep another set of extensions optional on a per-feature basis as we find out the hard way what isn't stable in-field.

Sunday, February 28, 2010

One More On VBOs - glBufferSubData

So if you survived the timing of VBO updates (or rather, my speculations on what is possible with VBO updates), now you're in a position to ask the question: how fast might glBufferSubData be? In particular, developers like myself are often astonished when glBufferSubData does things like block.

In a world before manual synchronizing of VBOs (via the 3.0 buffer management APIs or Apple's buffer range extensions) we can now see why a sub-data buffer on a streamed VBO might perform quite badly.

The naive code goes something like this:

Fill half the buffer with buffer sub-data.
Issue a draw call to that half of the buffer.
Flip which half of the buffer we are using and go back to step 1.

In other words, double buffering by dividing the buffer in half, or treating it like a ring buffer.

This implementation is going to perform terribly. T sub-data call is going to block until the previous draw call has completed, even though they use opposite halves of the buffer, and we'll lose all of our concurrency. Let's see if we can understand why.

If we go to respecify a VBO in AGP memory using glBufferSubData while that VBO is in progress, glBufferSubData must block; it can't rewrite the buffer until the last draw finishes because we would see the new vertices, not the old, or maybe half and half. In order for the "fill" to complete, the driver would have to be able to determine that the pending draws and the new fill are completely disjoint.

There are two reasons why the driver might not be able to figure this out:

You've drawn using glDrawElements, and thus the actual part of the vertex VBO you draw from is determined by the index table. The cost of figuring out the "extent" of this draw is to process all of the indices. The cure is worse than the disease. Any sane driver is going to simply assume that any part of the VBO could be used.
Let's assume you use glDrawRangeElements to tell the driver that you're really only going to use half the VBO. Even then, the structure to mark "locked" regions would be a complex one - a series of draws over overlapping regions would require a complex data structure. For this one special case, you're asking the drivers to replace a simple time-stamp based lock (e.g. this VBO is locked until this many commands have executed) with a dynamic range marking structure. If I were a driver writer I'd say "let's keep it simple and not eat this cost on all VBOs."

I think it's safe to assume that some implementations (and all if you use glDrawElements) are simply going to mark the entire VBO as in use until the draw happens, and thus the partial rewrite is going to block as if there was a conflict, even if there was not.

Can we do anything about this? Besides falling back to an "orphaned" approach where we get a fresh buffer each time, our alternative is to use the more exact APIs from ARB_map_buffer_range or APPLE_flush_buffer_range. With these APIs we can map only the part of the VBO we know is not in use, with the unsynchronized bit set to avoid blocking because the other half is pending. We can use flush explicit to then flush only the areas we modified. (With the 3.0 APIs we can also use the discard range option to simply say "we are rewriting what we map".)

Of course, this technique isn't without peril - all synchronization is up to the client. The main danger is an over-run: your app is so fast that it needs to modify a range that the GL isn't done with - we made it all the way around our ring buffer. Probably the safest way to cope with this is to put explicit fences in place to wait until the last dependent draw call that we issued is finished.

Double-Buffering Part 2 - Why AGP Might Be Your Friend

In my previous post I suggested that to get high VBO vertex performance in OpenGL, it's important to decouple pushing the next set of vertices from the GPU processing the existing ones. A naively written program will block when sending the next set of vertices until the last one goes down the pipe, but if we're clever and either orphan the buffer or use the right flags, we can avoid the block.

(My understanding is that orphaning actually gets you a second buffer, in the case where you want to double the entire buffer. With manual synchronization we can simply be very careful and use half the buffer each frame. Very careful.)

Now I'm normally a big fan of geometry in VRAM because it is, to put it the Boston way, "wicked fast". And perhaps it's my multimedia background popping up, but to me a nice GPU-driven DMA seems like the best way to get data to the card. So I've been trying to wrap my head around the question: why not double-buffer into VRAM? This analysis is going to get into the highly speculative - the true answer I think is "the devil is in the details, and the details are in the driver", but at least we'll see that the issue is very complex, double-buffering into VRAM has a lot of things that could go wrong, so we should not be surprised if when we tell OpenGL that we intend to stream our data it gives us AGP memory instead.*

Before we look at the timing properties of an application using AGP memory or VRAM, let's consider how modern OpenGL implementations work: they "run behind". By this I mean: you ask OpenGL to draw something, and some time later OpenGL actually gets around to doing it. How much behind? Quite possibly a lot. The card can run behind at least an entire frame, depending on implementation, maybe two. You can keep telling the GPU to do more stuff until:

You hit some implementation defined limit (e.g. you get 2 full frames ahead and the GPU says "enough!"). Your app blocks in the swap-backbuffer windowing system call.
You run out of memory to build up that outstanding "todo" list. (Your app blocks inside the GL driver waiting for command buffers - the memory used to build the todo list.)
You ask the OpenGL about something it did, but it hasn't done it. (E.g. you try to read an occlusion query that hasn't finished and block in the "get' call.)
You ask to take a lock on a resource that is still pending for draw. (E.g. you do a glMapBuffer on a non-orphaned VBO with outstanding draws, and you haven't disabled sync with one of the previously mentioned extensions.)

There may be others, but I haven't run into them yet.

Having OpenGL "run behind" is a good thing for your application's performance. You can think of your application and the GPU as a reader-writer problem. In multimedia, our top concern would be underruns - if we don't "feed the beast" enough audio by a deadline, the user hears the audio stop and calls tech support to complain that their expensive ProTools rig is a piece of junk. With an OpenGL app, underruns (the GPU got bored) and overruns (the app can't submit more data) aren't fatal, but they do mean that one of your two resources (GPU and CPU) are not being fully used. The longer the length of the FIFO (that is, the more OpenGL can run behind without an overrun) the more flexibility we have to have the speed of the CPU (requesting commands) and the GPU (running the commands) be mismatched for short periods of time.

An example: the first thing you do is draw a planet - it's one VBO, the app can issue the command in just one call. Very fast! But the planet has an expensive shader, users a ton of texture memory, and fills the entire screen. That command is going to take a little time for the GPU to finish. The GPU is now "behind." Next you go to draw the houses. The houses sit in a data structure that has to be traversed to figure out which houses are actually in view. This takes some CPU time, and thus it takes a while to push those commands to the GPU. If the GPU is still working on the planet, then by the time the GPU finishes the planet, the draw-house commands are ready, and the GPU moves seamlessly from one task to the other without ever going idle.

So we know we want the GPU to be able to run behind and we don't want to wait for it to be done. How well does this work with the previous posts double-buffer scheme? It works pretty well. Each draw has two parts: a "fill" operation done on the CPU (map orphaned buffer, write into AGP memory, unmap) and a later "draw" operation on the GPU. Each one requires a lock on the buffer actually being used. If we can have two buffers underneath our VBO (some implementations may allow more - I don't know) then:

The fill operation on frame 3 will wait for the draw operation on frame 1.
The fill operation on frame 4 will wait for the draw operation on frame 2.
The draw operation on frame N always waits for the fill operation (of course).

This means we can issue up to two full frames of vertices. On the third frame (if frame one is still not finished) only then might we block. That's good enough for me.

If the buffer is going to be drawn from VRAM, things get trickier. We now have three steps:

"fill" the system RAM copy. Fill 2 waits on DMA 1.
"DMA" the copy from system RAM to VRAM. DMA 2 waits on fill 2 and draw 1.
"draw" the copy from VRAM. Draw 1 waits on DMA 1.

Now we can start to see why the timing might be worse if our data is copied to VRAM. That DMA transfer is going to have to happen after the last draw (so the VRAM buffer is available) and before the next fill (because we can't fill until the data has been safely copied). It is "sandwiched" and it makes our timing a lot tighter.

Consider the case where the DMA happens right after we finish filling the buffer. In this case, the DMA is going to block on the last draw not completing - we can't specify frame 2 until frame 1 draw is mostly done. That's bad.

What about the case where the DMA happens really late, right before the draw really happens. Filling buffer 2 is going to block taking a lock until the previous frame 1 DMA completes. That's bad too!

I believe that there is a timing that isn't as bad as these cases though: if the OpenGL driver can schedule the DMA as early as possible once the card is done with the last draw, the DMA ends up with timing somewhere in between these two cases, moving around depending on the actual relationship between GPU and CPU speed.

At a minimum I'd summarize the problem like this: since the DMA requires both of our buffers (VRAM and system) to be available at the same time, the DMA has to be timed just right to keep from blocking the CPU. By comparison, a double-buffered AGP strategy simply requires locking the buffers.

To complete this very drawn out discussion: why would we even want to stream out of VRAM? As was correctly pointed out on the OpenGL list, this strategy requires an extra copy of the data - our app writes it, the DMA engine copies it, then the GPU reads it. (With AGP, the GPU reads what we write.) The most compelling case that I could think of, the one that got me thinking about this, is the case where the streaming ratio isn't 1:1. We specify our data per frame, but we make multiple rendering passes per frame. Thus we draw our VBO perhaps 2 or 3 times for each rewrite of the vertices, and we'd like to only use bus up once. A number of common algorithms (environment mapping, shadow mapping, early Z-fill) all run over the scene graph multiple times, often with the assumption that geometry is cheap (which mostly it is).

But this whole post has been pretty much entirely speculative. All we can do is clearly signal our intentions to the driver (are we a static, stream, or dynamic draw VBO) and orphan our buffers and hope the driver can find a way to keep giving us buffers rapidly without blocking, while getting our geometry up as fast as possible.

* We might want to assume this and then be careful about how we write our buffer-fill code so that it is efficient in uncached write-combined memory: we want to fill the buffer linearly in big writes and not read or muck around with it.