Thursday, July 23, 2009

Degenerate Shape Files

The shapefile specification is more like a pipe dream than a file format; it is a document that describes what people wish would happen in shapefiles. Examples of things one might wish for:
  • No self-intersecting, non-simple, or generally hosed polygons.
  • Polygon windings are CW for outer boundaries and CCW for holes.
  • Whatever convention a shapefile decides to follow, the same convention is used for the entire file.
Unfortunately shapefiles don't always work that way. I discovered that some of the SWBD tiles violate the winding rule.

Fortunately times are changing: with so much open source GIS going on, often you can find a log of other people's misery and pretty quickly get a handle on the problem.

Of particular interest is the fight between correctness and performance in the trac log. How do we rapidly import our data if we don't know how good it is? Checking data quality is expensive?

Ways To Interpret a Broken Shapefile

There are a few ways to interpret a broken shapefile:
  • We can go strictly by winding order, even when it is wrong, and perform a union on overlapping areas.
  • We can use a "binary" approach to polygons (think of what you get when you rasterize a non-simple polygon) where we switch inside and outside every time we cross an edge.
  • We can simply throw out data that violates a spec requirement.
  • We can assume (for better or for worse) that the first winding is outer, the rest are inner.
  • We can do a mix of this based on heuristics (e.g. what OGR does: if the windings look sane, trust them, otherwise slowly calculate ring nesting).
I have chosen the "binary" approach for the following reasons:
  • It is never dependent on the order of windings in the file or the direction of windings.
  • It works correctly with polygons with holes, polygons with islands in holes, and multipolygons (multiple disjoint outer rings), as long as the polygons are non-overlapping.
  • When polygons do overlap, the results are reasonably acceptable.
  • The implementation is really fast.
To understand that last point, we have to look at how CGAL processes polygons.

A Sweeping Improvement in Performance

CGAL implements sweep-line algorithms. It's beyond this post to explain how it works, but the main thing to note is that sweep-line processes a big pile of overlapping line segments in O(nlogn+klogn) time where n is the number of curves and k is the number of intersections. When adding/merging or inserting a large number of line segments into a planar map, this usually beats the pants off incremental modification algorithsm.

(A particularly painful case is building a very large ring in an empty map...each insert has to check for intersections with every other segment in the map with basically no spatial indexing or speed-up. Ouch!)

Of course we can go a lot faster when we know our data doesn't overlap, but go look at the title of the blog post...we're dealing with shape files here! We'll count ourselves lucky if polygons have 3 or more points.

Sweep-line is found in a number of interesting places:
  • Bulk-adding line segments to an empty (or very full) map.
  • Overlaying one map with another.
  • Boolean operations between two sets of polygons.
  • Checking whether a polygon is simple. (This is done with a stripped down sweep algorithm, but time complexity should be about the same.)
This list is interesting because approaches that use these different APIs are going to run at about the same time complexity. The number of times we use these operations might matter a lot.

Adding Up Lots Of Polygons

CGAL's general polygon set provides bulk insertion routines - given a range of polygons (or polygons with holes) you can join/difference them all at once. Looking at the code, it appears to use a divide and conquer scheme. I think this should be more optimal than a series of individual merges (which has to be slow, because by the end we are re-merging the finished data over and over), but frankly I can't tell where the various edges get consolidated (if ever).

Here's the problem: if we want to insert polygons with holes, we need to insert non-degenerate ones. (If we insert polygons, then subtract the holes, the results are not the same!) But...non-degenerate polygons, where do we get those?

Rather than an incremental approach (clean each winding, organize the windings, build sane polygons with holes, then merge them) I observe that we don't need clean input to use the "binary" rule for polygon interiors.

So my new shapefile processor simply inserts every single line segment from the entire file into an empty arrangement at once. This is about as fast as we can get - a single sweep that simultaneously deals with overlapping polygons, degenerate windings, and incorrect winding relationships.

I then do a breadth-first-search from the outside of the map to the inside, assigning properties. My arrangement tracks the original providing polygon so each time I "cross" a boundary I can track the change in ownership via a set.

The memory cost is O(F + E) where F is the number of faces in the final imported arrangement and E is the number of polygon edges. The time complexity is O(ElogE+KlogE + F) where K is the number of intersections our shape file turned out to have. (In practice, that F doesn't matter, we're almost guaranateed to have more edge than situations in a real world situation.

Friday, July 17, 2009

CGAL: Curves, Faces and Caution

In CGAL, the general polygon set code, which lets you do boolean operations on polygons (e.g. cut this polygon out of that polygon) has an arrangement (a planar map) inside it doing the real work.

You can get access to this arrangement and manipulate it directly, or do other "clever" things.

But one thing you might not want to do is to build the arrangement out of curves, and then attempt to pull the polygons out.

It seems like it should be a good idea...but it's not. The problem is that the "exporter" - the code that gives you back your polygons from the underlying arrangement (after you have processed the arrangement) assumes that each inserted curve runs in the direction of the contour of the polygon that created it.

If you don't do this (for example, you insert intersecting curves) the exporter will return hosed polygon contours.

Tuesday, July 07, 2009

GeoTIFF and Off-By-One

I fear I have stumbled into GeoTIFF off-by-one hell. There is surprisingly little written about this, considering that (1) GeoTIFF is somewhat widely used and (2) the whole point of GeoTIFF is that the file tells you where the data goes. is a summary of what I've stumbled into.

GeoTIFF has a flag to indicate whether pixels are "pixel is area" or "pixel is point". To summarize:
  • Pixel is area means that each pixel is a rectangle, with infinitely thin coordinates bounding all four sides. The sample data represents a summary of the entire area, or maybe a sample at the center of the area.
  • Pixel is point means that each pixel is the infinitely tiny intersection of two infinitely thin grid-points and represents the data as sampled at that point.
Who cares? Well, it turns out it matters. Imagine we have "3 second" data, which means that there are 1200 samples per degree, each one approximately 90 meters apart. (This is geographic projection, so the postings vary with latitude - I mention 90 meters only for the "feel" of the data.)

Let's consider what we have to do if we want data that covers exactly one degree.

If we are using area pixels, that one degree needs 1200 samples, each one 1/1200th of a degree wide. The left edge of the left most sample would be on a longitude line and the right edge of the right-most sample would be too. The next tile should have no data duplication from this tile.

If we are using point pixels, that one degree needs 1201 samples, each 1/1200th of a degree apart, but with no width on their own. The longitude line of the left edge has samples exactly on it, as does the right edge. And in fact, the right edge of our tile is a duplicate of the left-edge of the previous tile.

We could choose to exclude two of the four edges to in the pixel-center case to avoid duplication, but when we're tiling with pixel-center data, it is sort of nice to have the entire tile.


The SRTM is pixel-center, at least in the raw data that JPL provides; we can only assume that every derived product should be that way too.

When you download a seamless SRTM tile from the USGS you get 1201 pixels, tagged as area pixels, and a bounding box that goes 1.5 arc-seconds outside the bounds you might expect. This is a little bit strange (IMO) but more or less correct.

The CGIAR SRTMs in GeoTIFF appear to be shifted by half a pixel: while the ARC ASCII files come in 6001 blocks that are clearly point-pixel (and this is what we would expect from a SRTM-derived product) the GeoTIFF version provides 6000 postings, but is listed as area-pixel with bounds on tile boundaries. Visual inspection indicates that the south and east rows have been dropped to avoid duplication; this would imply that the bounding box should be inset on the south and east edges by 1.5 arc seconds (half a pixel) and outside on the other edges.

Follow-Up 4/1/11: the v4 CGIAR SRTM files appear to have been recut at 6001x6001 for correct pixel-center formatting. The GeoTIFFs are marked as area pixels and the bounding box is expanded by 0.5 samples on all sides.


NED appears to truly be area aligned (unlike SRTM); every form I have seen has been 3600 pixels centered on a lat-lon tile. In this case area pixels is an appropriate way to tag the data set.

Friday, July 03, 2009

CAS And Reference Counting Don't Mix

It took me a bit of head-scratching to understand this, but...

Atomic Reference Counting

You can use atomic operations to reference count objects. For example, if you look at the GNU C++ string implemention, you'll see that a string object is a pointer to a shared "real string". When a string is copied, the reference count is incremented. When the reference count drops to 0, the string that dropped the last reference deletes the buffer. ("Last one out, turn off the lights.")

The string doesn't need locks because it uses atomic operations to maintain the reference count. In particular, there is no race between thread A dropping the count to zero and thread B cloning the buffer again. If thread B is in a piece of code that can be cloning the buffer, thread B must be doing it via an object that has a reference to the buffer...thus thread A's decrement could not drop to 0 - at worst it could drop to 1 (B having the remaining count).

Compare and Swap (CAS) For Structures

CAS can be a useful way to modify a structure. A typical CAS operation will:
  • Make a local value of a "slot" to be modified.
  • Calculate a new value for the slot.
  • Attempt to CAS the new value in for the old.
  • If the slot contains a value other than the old value (the CAS fails) we know some other thread got in there instead. We loop around and retry.
For example, we can push a new object onto a list by first preparing our new head node and then CASing. If the head pointer changed, we know that our newly prepared node is not pointing to the old head and we need to retry.

CAS + Reference Count = Race Condition

One way to make a general purpose data structure lock-free and thread-safe is to have the structure accessible by pointer. The modify operation uses CAS - that is, the modifier clones the structure, makes the modifications, and attempts to swap the structure back in by pointer (using CAS). If the swap fails, the writer thread has to retry.

In order for this to work, we need to be sure that the old structure is not destroyed while anyone is reading it.

My immediate thought was: no problem - we can use reference counts. We will increase the reference count when using that old structure and not throw it out until everyone is done.

Such an algorithm would go something like this:
to read:
retain our struct
read it
release our struct

to modify:
retain our struct
copy it
modify the copy
attempt to CAS the structure back
if the CAS succeeds, release the old struct twice.
if the CAS fails, release the old struct once and the new struct once and retry.
(Where did I get those release counts? Well, the current struct is retained once as "the master copy". We retain it again to assure that it isn't deleted while we copy it. So if we successfully modify, we need to release it twice - once for us and once because it isn't the master copy any more. If the CAS fails, we release only our "don't delete while we copy" reference, and we release our new copy to deallocate it.)

Why This Fails

It turns out that you can't implement this structure using typical atomic operations (CAS and atomic increment/decrement). The problem is in the read. How exactly do we increase the retain count?

In GNU's string they do something like this:
And that works great, because my_buffer is immutable. But in our operation, the pointer to the struct is subject to atomic operations. And yet we have to dereference it to get to the reference count.

The code, when broken down, would look something like this:
  1. dereference my_buffer to find the effective address of the reference count.
  2. atomically modify that reference count at its address.
The problem is that between step 1 and 2 (these are not atomic with regards to each other) someone might first CAS the pointer to the shared object (and it won't fail since we haven't modified that pointer) and then decrease the reference count of the shared object to zero, deleting it.

Now for step 2 the address we have computed is already bogus! Doh!

This can be worked around if an architecture supports very strong conditional operations based on memory modification. In other words, if we can execute the atomic increment conditional on an unmodified pointer to the buffer, we can correctly "bounce out and retry" if someone switches the underlying object before we get our reference count in.

Of course, most of the machines in my office don't have that kind of hardware.

Hazard Pointers - Why They Work

There is an alternate design pattern that works around this: hazard pointers. The basic idea (to heavily oversimplify things) is this: there is a finite amount of memory (where each slot is written to by exactly one thread to avoid having to lock or control this memory) where a thread can declare "hey, I am using this structure".

Deleting structures don't delete a memory block if anyone is using it, which they determine by scanning the block of hazard pointers.

Why do hazard pointers work? They work because, unlike our reference counts, we don't have to first dereference a pointer to "protect" it (with the reference count). Since the hazard pointer is the pointer to the block, we can protect first, dereference second, something like this:
set our hazard pointer to the block we want to use
CAS the ptr to the block with itself and its old value to detect that it hasn't changed already.
If it has, loop back and try again.
If it has not changed, we are now safe to use it, because our hazard pointer is established.
The clever aspect of hazard pointers is that, by providing per-thread storage, the total number of places to check for hazard pointers can be relatively small, but no dereferencing is needed.

Thursday, July 02, 2009

Going Lock-Free - Shared Resources

X-Plane uses a reference counting scheme for resources like textures and models. Each time some scenery element needs a texture, it asks a central manager to grab the texture. This might result in a new object being created, or it might result in an increased reference count to an existing object.

During the version 9 run it became apparent that a central lock on the "big list of textures" was going to be increasingly problematic - the lock was already showing up in performance profiles under only moderate strain. Clearly a central lock was not going to be an option.

The scheme we use now is based on a central lock that is only needed for resource allocation/destruction - not access. It goes something like this:
  • Resources are always referenced by a pointer, not an index. This means that client code that simply needs to utilize the resource can do so without access to the "central list". Thus resource usage can be lock free.

    (This is the most important case. We only create and destroy a texture once but we might use thousands of textures per frame.)

  • The reference count on an object is atomic, so we don't need to lock the object to add or delete references. From a performance standpoint this is nice but not critical.

  • Resource deletion is explicit. This is the most surprising part: we don't actually release a resource until we explicitly make a "garbage collect" call. This has two effects:

    1. We avoid thrash because we can start re-using a zero-reference object before it is thrown out.
    2. It limits the locking of the big list of objects to one explicit function call.
The result of this is a system that doesn't lock during the main frame loop, under any conditions and doesn't thrash too badly during load.

Threading and Sanity

Threaded code is expensive - it takes longer to write and is harder to debug. Here are some of the things I try to do to keep myself sane while working on threaded code.
  • Only thread when there is a need for threading. Threading is a way to capture CPU performance. I consider threading as a way to meet performance goals, not as a default.

    This also means not trying to thread code that is either not performance critical or isn't burning up enough CPU to be worth it. The less code is threaded, the cheaper the app is to develop, so better to only thread the "big fish".

  • The main thread handles flow control. A number of different graphic and UI APIs require access only from the main thread, so a natural extension of this design (particularly in a game with a sequential render-loop).

    The requirement to run on the main thread simplifies some synchronization issues because certain API calls into an object will be "naturally exclusive" due to their main-thread requirement. A debug macro catches illegal calls to these functions from the wrong thread.

  • Message queues for resource ownership. With this idiom, an object can have only one owning thread - the ownership is transferred via thread-safe message queues. Because the message queue is a synchronizer, no locks are required and no dead-locks can occur.