## Monday, December 22, 2008

### Compressed Vectors - Part 1

Normally for data that has a lot of repeating values, I use run-length encoding - it works particularly well when the data is accessed from start to end in forward order.  For example, the spot lighting masks for X-Plane's 2-d panels are RGBA image that are RLE compressed. Continuous scan lines of transparent window become a single RLE line segment.  This actually makes processing faster, because the decision to skip transparent areas is made once per scan line, not once per pixel.

Things get a little bit trickier if we need random access to the data.  The structure I am playing with now provides something similar to that.  It is a vector where continuous values are stored once in runs.  It has two parts:
• A run index (stored as a sorted vector of key-value pairs) maps the beginning of a run (in the virtual data) to the start of storage (in a real data vector).
• A real data vector contains each value as few times as possible.
We don't need to store "how long is this run" in the data vector - the index gives it to us.  We just look at the next index key-value pair and see (1) how long is our run and (2) how much data is behind it.  If only one value backs our 5-value run, we are run-length compressed.

Complexity properties: call N the number of items, U the number of unique items, and R the number of runs (that is, the number of times we change from RLE to non-RLE data.
• In the worst case (totally non-sequential data), the storage is the same as vector: O(N).  In the best case (all the same) we are O(1).  In both cases, R=1.
• Generally our storage is O(R+U), which should be less than O(N).
• Random access is O(log(R)) - not quite constant time, but pretty fast.  For the worst case random data, R is 1, so we can be fast that way.
• Pushing data onto the end is constant time plus any cost of memory reallocation; since the vector is compressed, hopefully that is less.
• Deleting data from the front is O(R+U) instead of O(N), which is a bit of an improvement.
At this point the unit tests work, but I don't have stats on real-world compression.