Things get a little bit trickier if we need random access to the data. The structure I am playing with now provides something similar to that. It is a vector where continuous values are stored once in runs. It has two parts:
- A run index (stored as a sorted vector of key-value pairs) maps the beginning of a run (in the virtual data) to the start of storage (in a real data vector).
- A real data vector contains each value as few times as possible.
Complexity properties: call N the number of items, U the number of unique items, and R the number of runs (that is, the number of times we change from RLE to non-RLE data.
- In the worst case (totally non-sequential data), the storage is the same as vector: O(N). In the best case (all the same) we are O(1). In both cases, R=1.
- Generally our storage is O(R+U), which should be less than O(N).
- Random access is O(log(R)) - not quite constant time, but pretty fast. For the worst case random data, R is 1, so we can be fast that way.
- Pushing data onto the end is constant time plus any cost of memory reallocation; since the vector is compressed, hopefully that is less.
- Deleting data from the front is O(R+U) instead of O(N), which is a bit of an improvement.