TL;DR
This paper investigates how reordering columns in column-oriented indexes can significantly improve compression by reducing runs, but finding the optimal order is NP-hard, with simple heuristics like sorting by increasing cardinality being effective.
Contribution
It proves the NP-hardness of optimal column ordering for run minimization and demonstrates that sorting by increasing cardinality is a practical heuristic.
Findings
Sorting columns by increasing cardinality minimizes runs.
Hilbert space-filling curves are ineffective for run minimization.
Column reordering can halve the number of runs in realistic datasets.
Abstract
Column-oriented indexes-such as projection or bitmap indexes-are compressed by run-length encoding to reduce storage and increase speed. Sorting the tables improves compression. On realistic data sets, permuting the columns in the right order before sorting can reduce the number of runs by a factor of two or more. Unfortunately, determining the best column order is NP-hard. For many cases, we prove that the number of runs in table columns is minimized if we sort columns by increasing cardinality. Experimentally, sorting based on Hilbert space-filling curves is poor at minimizing the number of runs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
