Lightweight Correlation-Aware Table Compression
Mihail Stoian, Alexander van Renen, Jan Kobiolka, Ping-Lin Kuo, Josif, Grabocka, Andreas Kipf

TL;DR
This paper introduces Virtual, a framework that enhances open data formats with automatic correlation-aware compression, significantly reducing storage size with minimal impact on scan performance.
Contribution
Virtual seamlessly integrates with existing formats to automatically exploit data correlations, achieving substantial compression improvements without manual correlation specification.
Findings
Reduces file sizes by up to 40% on real datasets.
Maintains high scan performance with minimal overhead.
Outperforms traditional compression methods in open data formats.
Abstract
The growing adoption of data lakes for managing relational data necessitates efficient, open storage formats that provide high scan performance and competitive compression ratios. While existing formats achieve fast scans through lightweight encoding techniques, they have reached a plateau in terms of minimizing storage footprint. Recently, correlation-aware compression schemes have been shown to reduce file sizes further. Yet, current approaches either incur significant scan overheads or require manual specification of correlations, limiting their practicability. We present , a framework that integrates seamlessly with existing open formats to automatically leverage data correlations, achieving substantial compression gains while having minimal scan performance overhead. Experiments on data-gov datasets show that reduces file sizes by up to 40%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Distributed and Parallel Computing Systems · Advanced Data Storage Technologies
