The Hacks of Life: Degenerate Shape Files

The shapefile specification is more like a pipe dream than a file format; it is a document that describes what people wish would happen in shapefiles. Examples of things one might wish for:

No self-intersecting, non-simple, or generally hosed polygons.
Polygon windings are CW for outer boundaries and CCW for holes.
Whatever convention a shapefile decides to follow, the same convention is used for the entire file.

Unfortunately shapefiles don't always work that way. I discovered that some of the SWBD tiles violate the winding rule.

Fortunately times are changing: with so much open source GIS going on, often you can find a log of other people's misery and pretty quickly get a handle on the problem.

Of particular interest is the fight between correctness and performance in the trac log. How do we rapidly import our data if we don't know how good it is? Checking data quality is expensive?

Ways To Interpret a Broken Shapefile

There are a few ways to interpret a broken shapefile:

We can go strictly by winding order, even when it is wrong, and perform a union on overlapping areas.
We can use a "binary" approach to polygons (think of what you get when you rasterize a non-simple polygon) where we switch inside and outside every time we cross an edge.
We can simply throw out data that violates a spec requirement.
We can assume (for better or for worse) that the first winding is outer, the rest are inner.
We can do a mix of this based on heuristics (e.g. what OGR does: if the windings look sane, trust them, otherwise slowly calculate ring nesting).

I have chosen the "binary" approach for the following reasons:

It is never dependent on the order of windings in the file or the direction of windings.
It works correctly with polygons with holes, polygons with islands in holes, and multipolygons (multiple disjoint outer rings), as long as the polygons are non-overlapping.
When polygons do overlap, the results are reasonably acceptable.
The implementation is really fast.

To understand that last point, we have to look at how CGAL processes polygons.

A Sweeping Improvement in Performance

CGAL implements sweep-line algorithms. It's beyond this post to explain how it works, but the main thing to note is that sweep-line processes a big pile of overlapping line segments in O(nlogn+klogn) time where n is the number of curves and k is the number of intersections. When adding/merging or inserting a large number of line segments into a planar map, this usually beats the pants off incremental modification algorithsm.

(A particularly painful case is building a very large ring in an empty map...each insert has to check for intersections with every other segment in the map with basically no spatial indexing or speed-up. Ouch!)

Of course we can go a lot faster when we know our data doesn't overlap, but go look at the title of the blog post...we're dealing with shape files here! We'll count ourselves lucky if polygons have 3 or more points.

Sweep-line is found in a number of interesting places:

Bulk-adding line segments to an empty (or very full) map.
Overlaying one map with another.
Boolean operations between two sets of polygons.
Checking whether a polygon is simple. (This is done with a stripped down sweep algorithm, but time complexity should be about the same.)

This list is interesting because approaches that use these different APIs are going to run at about the same time complexity. The number of times we use these operations might matter a lot.

Adding Up Lots Of Polygons

CGAL's general polygon set provides bulk insertion routines - given a range of polygons (or polygons with holes) you can join/difference them all at once. Looking at the code, it appears to use a divide and conquer scheme. I think this should be more optimal than a series of individual merges (which has to be slow, because by the end we are re-merging the finished data over and over), but frankly I can't tell where the various edges get consolidated (if ever).

Here's the problem: if we want to insert polygons with holes, we need to insert non-degenerate ones. (If we insert polygons, then subtract the holes, the results are not the same!) But...non-degenerate polygons, where do we get those?

Rather than an incremental approach (clean each winding, organize the windings, build sane polygons with holes, then merge them) I observe that we don't need clean input to use the "binary" rule for polygon interiors.

So my new shapefile processor simply inserts every single line segment from the entire file into an empty arrangement at once. This is about as fast as we can get - a single sweep that simultaneously deals with overlapping polygons, degenerate windings, and incorrect winding relationships.

I then do a breadth-first-search from the outside of the map to the inside, assigning properties. My arrangement tracks the original providing polygon so each time I "cross" a boundary I can track the change in ownership via a set.

The memory cost is O(F + E) where F is the number of faces in the final imported arrangement and E is the number of polygon edges. The time complexity is O(ElogE+KlogE + F) where K is the number of intersections our shape file turned out to have. (In practice, that F doesn't matter, we're almost guaranateed to have more edge than situations in a real world situation.

The Hacks of Life

Thursday, July 23, 2009

Degenerate Shape Files

No comments:

Post a Comment