Googling around for this I found a few ideas based on using a sorted map, with the interval beginning as the key and the interval end as the value. My approach is different, and is closer to the original implementation of Macintosh regions: a vector of beginning and ending pairs, e.g. { 3, 6, 8, 10 }.
I'm not sure whether this approach is superior to a map-based approach; I think I'd have to code each one all the way to completion. The vector does have a few advantages:
- Compact storage, with minimal overhead.
- Reading the sorted array can usually be done in linear or log-N time.
- Intervals are inclusive at the bottom and exclusive at the top. So the interval 3,6 includes the number 3 but excludes the number 6. (Thus given two intervals [3,6) and [6,9) the number 6 is included in exactly one of them.)
- Intervals must have non-zero length (so [3,3) is illegal).
- Intervals are finite - there is no notation to say "everything below 3" is included.
(The actual time is actually slightly worse because vector will need to periodically reallocate its memory during the creation of the new resulting vector. We could use a heuristic to pre-allocate some space at a loss of memory efficiency. If we used a set we could avoid memory costs, but we'd end up with O(NlogN) time to build the set anyway, and we'd pay node overhead, which is almost certainly worse than any extra on a vector of 32-bit floats or integers.)
When we search for an interval (using lower_bounds) we can tell whether we are "in" or "out" of the region by looking at whether the index of the returned region is even or odd - even regions are in the set and odd ones are outside of it.
The interval class is also heavily special cased for a number of optimizations:
- Separate operators on pair
allow for the processing of a single interval (rather than a set). When we know the single interval, we can take a number of short-cuts, and we can perform "in-place editing" using log-N searches into the original interval set. - Operations on sets can identify short cuts. For example, the intersection of two sets whose range is disjoint is always empty. (In other words, if the last value in A is less than the first value in B, intersecting A and B is an empty set.)
As an example of why this might matter: consider symmetric difference (XOR) of an interval set and a single range. This operation can be computed simply by: deleting the range bounds from the set if they exist, otherwise inserting them. In other words, given the interval set [0,3) [6,9) [12,15) we can XOR this with the interval [6,8) by deleting 6 and inserting 8 - the new XOR is [0,3) [8,9) [12,15). This is a relatively fast operation: two log-N searches (for 6 and 8) and one delete and one insert.
Despite the simplicity of the algorithm, vector is going to require two mid-vector editing operations, so our average time complexity is O(N) - linear! (On average half the elements of the vector are after us, and we do two editing ops.)
For this reason, the special case of a disjoint XOR is special cased. If we XOR [-10, -8) into the above region, we can observe that -8 < 0, therefore the regions don't intersect, and -10, -8 simply needs to be pre-pended. This can be done with a single insert, and thus should run about twice as fast as a pair of individual inserts.