Tuesday, July 07, 2009

GeoTIFF and Off-By-One

I fear I have stumbled into GeoTIFF off-by-one hell. There is surprisingly little written about this, considering that (1) GeoTIFF is somewhat widely used and (2) the whole point of GeoTIFF is that the file tells you where the data goes. So...here is a summary of what I've stumbled into.

GeoTIFF has a flag to indicate whether pixels are "pixel is area" or "pixel is point". To summarize:
  • Pixel is area means that each pixel is a rectangle, with infinitely thin coordinates bounding all four sides. The sample data represents a summary of the entire area, or maybe a sample at the center of the area.
  • Pixel is point means that each pixel is the infinitely tiny intersection of two infinitely thin grid-points and represents the data as sampled at that point.
Who cares? Well, it turns out it matters. Imagine we have "3 second" data, which means that there are 1200 samples per degree, each one approximately 90 meters apart. (This is geographic projection, so the postings vary with latitude - I mention 90 meters only for the "feel" of the data.)

Let's consider what we have to do if we want data that covers exactly one degree.

If we are using area pixels, that one degree needs 1200 samples, each one 1/1200th of a degree wide. The left edge of the left most sample would be on a longitude line and the right edge of the right-most sample would be too. The next tile should have no data duplication from this tile.

If we are using point pixels, that one degree needs 1201 samples, each 1/1200th of a degree apart, but with no width on their own. The longitude line of the left edge has samples exactly on it, as does the right edge. And in fact, the right edge of our tile is a duplicate of the left-edge of the previous tile.

We could choose to exclude two of the four edges to in the pixel-center case to avoid duplication, but when we're tiling with pixel-center data, it is sort of nice to have the entire tile.

SRTM

The SRTM is pixel-center, at least in the raw data that JPL provides; we can only assume that every derived product should be that way too.

When you download a seamless SRTM tile from the USGS you get 1201 pixels, tagged as area pixels, and a bounding box that goes 1.5 arc-seconds outside the bounds you might expect. This is a little bit strange (IMO) but more or less correct.

The CGIAR SRTMs in GeoTIFF appear to be shifted by half a pixel: while the ARC ASCII files come in 6001 blocks that are clearly point-pixel (and this is what we would expect from a SRTM-derived product) the GeoTIFF version provides 6000 postings, but is listed as area-pixel with bounds on tile boundaries. Visual inspection indicates that the south and east rows have been dropped to avoid duplication; this would imply that the bounding box should be inset on the south and east edges by 1.5 arc seconds (half a pixel) and outside on the other edges.

NED

NED appears to truly be area aligned (unlike SRTM); every form I have seen has been 3600 pixels centered on a lat-lon tile. In this case area pixels is an appropriate way to tag the data set.

Friday, July 03, 2009

CAS And Reference Counting Don't Mix

It took me a bit of head-scratching to understand this, but...

Atomic Reference Counting

You can use atomic operations to reference count objects. For example, if you look at the GNU C++ string implemention, you'll see that a string object is a pointer to a shared "real string". When a string is copied, the reference count is incremented. When the reference count drops to 0, the string that dropped the last reference deletes the buffer. ("Last one out, turn off the lights.")

The string doesn't need locks because it uses atomic operations to maintain the reference count. In particular, there is no race between thread A dropping the count to zero and thread B cloning the buffer again. If thread B is in a piece of code that can be cloning the buffer, thread B must be doing it via an object that has a reference to the buffer...thus thread A's decrement could not drop to 0 - at worst it could drop to 1 (B having the remaining count).

Compare and Swap (CAS) For Structures

CAS can be a useful way to modify a structure. A typical CAS operation will:
  • Make a local value of a "slot" to be modified.
  • Calculate a new value for the slot.
  • Attempt to CAS the new value in for the old.
  • If the slot contains a value other than the old value (the CAS fails) we know some other thread got in there instead. We loop around and retry.
For example, we can push a new object onto a list by first preparing our new head node and then CASing. If the head pointer changed, we know that our newly prepared node is not pointing to the old head and we need to retry.

CAS + Reference Count = Race Condition

One way to make a general purpose data structure lock-free and thread-safe is to have the structure accessible by pointer. The modify operation uses CAS - that is, the modifier clones the structure, makes the modifications, and attempts to swap the structure back in by pointer (using CAS). If the swap fails, the writer thread has to retry.

In order for this to work, we need to be sure that the old structure is not destroyed while anyone is reading it.

My immediate thought was: no problem - we can use reference counts. We will increase the reference count when using that old structure and not throw it out until everyone is done.

Such an algorithm would go something like this:
to read:
retain our struct
read it
release our struct

to modify:
retain our struct
copy it
modify the copy
attempt to CAS the structure back
if the CAS succeeds, release the old struct twice.
if the CAS fails, release the old struct once and the new struct once and retry.
(Where did I get those release counts? Well, the current struct is retained once as "the master copy". We retain it again to assure that it isn't deleted while we copy it. So if we successfully modify, we need to release it twice - once for us and once because it isn't the master copy any more. If the CAS fails, we release only our "don't delete while we copy" reference, and we release our new copy to deallocate it.)

Why This Fails

It turns out that you can't implement this structure using typical atomic operations (CAS and atomic increment/decrement). The problem is in the read. How exactly do we increase the retain count?

In GNU's string they do something like this:
atomic_increment(&my_buffer->reference_count);
And that works great, because my_buffer is immutable. But in our operation, the pointer to the struct is subject to atomic operations. And yet we have to dereference it to get to the reference count.

The code, when broken down, would look something like this:
  1. dereference my_buffer to find the effective address of the reference count.
  2. atomically modify that reference count at its address.
The problem is that between step 1 and 2 (these are not atomic with regards to each other) someone might first CAS the pointer to the shared object (and it won't fail since we haven't modified that pointer) and then decrease the reference count of the shared object to zero, deleting it.

Now for step 2 the address we have computed is already bogus! Doh!

This can be worked around if an architecture supports very strong conditional operations based on memory modification. In other words, if we can execute the atomic increment conditional on an unmodified pointer to the buffer, we can correctly "bounce out and retry" if someone switches the underlying object before we get our reference count in.

Of course, most of the machines in my office don't have that kind of hardware.

Hazard Pointers - Why They Work

There is an alternate design pattern that works around this: hazard pointers. The basic idea (to heavily oversimplify things) is this: there is a finite amount of memory (where each slot is written to by exactly one thread to avoid having to lock or control this memory) where a thread can declare "hey, I am using this structure".

Deleting structures don't delete a memory block if anyone is using it, which they determine by scanning the block of hazard pointers.

Why do hazard pointers work? They work because, unlike our reference counts, we don't have to first dereference a pointer to "protect" it (with the reference count). Since the hazard pointer is the pointer to the block, we can protect first, dereference second, something like this:
set our hazard pointer to the block we want to use
CAS the ptr to the block with itself and its old value to detect that it hasn't changed already.
If it has, loop back and try again.
If it has not changed, we are now safe to use it, because our hazard pointer is established.
The clever aspect of hazard pointers is that, by providing per-thread storage, the total number of places to check for hazard pointers can be relatively small, but no dereferencing is needed.

Thursday, July 02, 2009

Going Lock-Free - Shared Resources

X-Plane uses a reference counting scheme for resources like textures and models. Each time some scenery element needs a texture, it asks a central manager to grab the texture. This might result in a new object being created, or it might result in an increased reference count to an existing object.

During the version 9 run it became apparent that a central lock on the "big list of textures" was going to be increasingly problematic - the lock was already showing up in performance profiles under only moderate strain. Clearly a central lock was not going to be an option.

The scheme we use now is based on a central lock that is only needed for resource allocation/destruction - not access. It goes something like this:
  • Resources are always referenced by a pointer, not an index. This means that client code that simply needs to utilize the resource can do so without access to the "central list". Thus resource usage can be lock free.

    (This is the most important case. We only create and destroy a texture once but we might use thousands of textures per frame.)

  • The reference count on an object is atomic, so we don't need to lock the object to add or delete references. From a performance standpoint this is nice but not critical.

  • Resource deletion is explicit. This is the most surprising part: we don't actually release a resource until we explicitly make a "garbage collect" call. This has two effects:

    1. We avoid thrash because we can start re-using a zero-reference object before it is thrown out.
    2. It limits the locking of the big list of objects to one explicit function call.
The result of this is a system that doesn't lock during the main frame loop, under any conditions and doesn't thrash too badly during load.

Threading and Sanity

Threaded code is expensive - it takes longer to write and is harder to debug. Here are some of the things I try to do to keep myself sane while working on threaded code.
  • Only thread when there is a need for threading. Threading is a way to capture CPU performance. I consider threading as a way to meet performance goals, not as a default.

    This also means not trying to thread code that is either not performance critical or isn't burning up enough CPU to be worth it. The less code is threaded, the cheaper the app is to develop, so better to only thread the "big fish".

  • The main thread handles flow control. A number of different graphic and UI APIs require access only from the main thread, so a natural extension of this design (particularly in a game with a sequential render-loop).

    The requirement to run on the main thread simplifies some synchronization issues because certain API calls into an object will be "naturally exclusive" due to their main-thread requirement. A debug macro catches illegal calls to these functions from the wrong thread.

  • Message queues for resource ownership. With this idiom, an object can have only one owning thread - the ownership is transferred via thread-safe message queues. Because the message queue is a synchronizer, no locks are required and no dead-locks can occur.

Wednesday, June 24, 2009

The True Power of NEDMalloc

I have dabbled with using NedMalloc in X-Plane...it is faster than the built-in allocator. But sometimes you have to find the right problem to see scalability issues.

I've been hacking at an architecture change (probably going into X-Plane 940) that allows X-Plane to build 3-d forests on an arbitrary number of processors.

(Currently X-Plane 930 can only do this on one processor, albeit not the processor you use to draw the world and fly the plane.)

Going to KBTV with the highest tree settings on an 8 core Mac Pro we get these numbers for "preload" times (that is, the time to pre-generate all nearby forests, which is basically a memory-intensive operation (since the trees are so simple):
  • 9.30: 1 core, 20 seconds.
  • 9.40, OS allocator: 8 cores, 40 seconds.
  • 9.40, NedMalloc: 8 cores, 2 seconds.
I am not making that up. The 8x or better speed-up isn't surprising, but what is incredible is the difference between the system allocator and NedMalloc with only 8 threads!

Monday, June 01, 2009

Heap Debugging (Memory/Resource Leak) with WinDbg

I recently had to do some heap debugging to solve an issue at work and it was a bit of a pain in the butt because there are several steps that I needed to take to set everything up. Here's what I had to do...

  1. First, you need the Debugging Tools for Windows installed.
  2. Open a command prompt and navigate to the debugging tools folder (usually C:\Program Files\Debugging Tools for Windows (x86).
  3. Type "gflags.exe /i yourApplication.exe +ust" (without the quotes). This command creates a user mode stack trace database that will be used later.
  4. Launch the application in WinDbg.
  5. Recreate the error and then break into the debugger.
  6. Run the command "!heap -s". This will display a summary of all of the current heaps. Find the one that seems to be bloated.
  7. Run the command "!heap -stat -h " on the block that appears to be bloated. This will display statistics on the heap's allocations by allocation block size. My problem was that one resource type was being leaked so there was one block size that was used far more frequently than the others. Find that block size.
  8. Run the command "!heap -flt s ". This will filter the heap to only show heaps of that particular size.
  9. Now choose one of the UserPtr's in the list and run the command "!heap -p -a " on it. This will display information on that particular block. It should also include a stack trace from when that memory was allocated.
  10. Remember to remove the global flags by doing "gflags.exe /i -ust" when you're done.

Monday, April 20, 2009

mod_rewrite for MediaWiki

In moving the X-Plane SDK Wiki to MediaWiki I had to use mod_rewrite. Getting this right took me a few tries.  This is what I ended up with:
# First: any old phpwiki witih a query is rewritten to the
# "clean" mediawiki URL. Note that we do not test for file
# existence - as of this writing the php index.php DOES exist.

RewriteCond %{QUERY_STRING} ^(.+)$
RewriteRule ^xpsdk/phpwiki/(.*)$ /xpsdk/mediawiki/%1? [R=301]

# Also, if we did not rewrite that means we had no query -
# means the default page - map it to the base root.

RewriteRule ^xpsdk/phpwiki/(.*)$ /xpsdk/mediawiki/ [R=301]

# Now we have "clean" wiki URLs, so do not make any more rewrites
# permanent. What we do next is to clean up problems.

# And finally and most importantly: because php is running in a CGI,
# it needs the "ugly" title=$1 form. So...convert the virtual path
# into CGI access.

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^xpsdk/mediawiki/(.+)$ /xpsdk/mediawiki/index.php?title=$1 [L,QSA]
Probably the most useful thing I discovered was: by using R=301 you can force the rewrite to be visible in the URL bar. This isn't always what you want in the long term, but it does let you see the output of mod_rewrite, which is handy.