Thursday, June 03, 2010

Interval Sets With the STL

I spent some time today working on an interval set. The basic idea of an interval set is to record a set of disjoint ranges that partition a number space into a finite "included" area and an infinite "excluded" area. Or to put it more simply, [3, 6) is an interval, and [3, 6) [8, 10) is an interval set.

Googling around for this I found a few ideas based on using a sorted map, with the interval beginning as the key and the interval end as the value. My approach is different, and is closer to the original implementation of Macintosh regions: a vector of beginning and ending pairs, e.g. { 3, 6, 8, 10 }.

I'm not sure whether this approach is superior to a map-based approach; I think I'd have to code each one all the way to completion. The vector does have a few advantages:
  • Compact storage, with minimal overhead.
  • Reading the sorted array can usually be done in linear or log-N time.
The basic rules for the interval set are:
  • Intervals are inclusive at the bottom and exclusive at the top. So the interval 3,6 includes the number 3 but excludes the number 6. (Thus given two intervals [3,6) and [6,9) the number 6 is included in exactly one of them.)
  • Intervals must have non-zero length (so [3,3) is illegal).
  • Intervals are finite - there is no notation to say "everything below 3" is included.
The heart of the algorithm is a "merge" operation. In a merge, the sequence of interval edges from two interval sets are traversed together (think of a merge sort) and each new smallest sub-interval is evaluated for inclusion by a boolean operation. This lets us perform a union, difference, intersection, or symmetric difference in linear time O(N+M) where N and M are the lengths of the vectors.

(The actual time is actually slightly worse because vector will need to periodically reallocate its memory during the creation of the new resulting vector. We could use a heuristic to pre-allocate some space at a loss of memory efficiency. If we used a set we could avoid memory costs, but we'd end up with O(NlogN) time to build the set anyway, and we'd pay node overhead, which is almost certainly worse than any extra on a vector of 32-bit floats or integers.)

When we search for an interval (using lower_bounds) we can tell whether we are "in" or "out" of the region by looking at whether the index of the returned region is even or odd - even regions are in the set and odd ones are outside of it.

The interval class is also heavily special cased for a number of optimizations:
  • Separate operators on pair allow for the processing of a single interval (rather than a set). When we know the single interval, we can take a number of short-cuts, and we can perform "in-place editing" using log-N searches into the original interval set.
  • Operations on sets can identify short cuts. For example, the intersection of two sets whose range is disjoint is always empty. (In other words, if the last value in A is less than the first value in B, intersecting A and B is an empty set.)
I haven't used the interval set class enough to profile it; real measurement will tell which of these optimizations is a win. One tricky aspect of the code is that vector is a leaky abstraction - it makes mid-vector insertion look cheap when really it is a linear operation (because all subsequent elements must be copied to their new locations).

As an example of why this might matter: consider symmetric difference (XOR) of an interval set and a single range. This operation can be computed simply by: deleting the range bounds from the set if they exist, otherwise inserting them. In other words, given the interval set [0,3) [6,9) [12,15) we can XOR this with the interval [6,8) by deleting 6 and inserting 8 - the new XOR is [0,3) [8,9) [12,15). This is a relatively fast operation: two log-N searches (for 6 and 8) and one delete and one insert.

Despite the simplicity of the algorithm, vector is going to require two mid-vector editing operations, so our average time complexity is O(N) - linear! (On average half the elements of the vector are after us, and we do two editing ops.)

For this reason, the special case of a disjoint XOR is special cased. If we XOR [-10, -8) into the above region, we can observe that -8 < 0, therefore the regions don't intersect, and -10, -8 simply needs to be pre-pended. This can be done with a single insert, and thus should run about twice as fast as a pair of individual inserts.