Squish: Near-Optimal Compression for Archival of Relational Datasets

Yihan Gao; Aditya Parameswaran

arXiv:1602.04256·cs.DB·June 21, 2016

Squish: Near-Optimal Compression for Archival of Relational Datasets

Yihan Gao, Aditya Parameswaran

PDF

TL;DR

Squish is a system that leverages relational structure and probabilistic models to achieve near-optimal compression of relational datasets, significantly reducing storage costs.

Contribution

We introduce Squish, a novel compression system using Bayesian Networks and Arithmetic Coding that captures attribute dependencies and supports user-defined data types, proving asymptotic optimality.

Findings

01

Achieves over 50% reduction in storage size compared to prior systems.

02

Effectively captures complex attribute dependencies in relational data.

03

Proven asymptotic optimality of the compression algorithm.

Abstract

Relational datasets are being generated at an alarmingly rapid rate across organizations and industries. Compressing these datasets could significantly reduce storage and archival costs. Traditional compression algorithms, e.g., gzip, are suboptimal for compressing relational datasets since they ignore the table structure and relationships between attributes. We study compression algorithms that leverage the relational structure to compress datasets to a much greater extent. We develop Squish, a system that uses a combination of Bayesian Networks and Arithmetic Coding to capture multiple kinds of dependencies among attributes and achieve near-entropy compression rate. Squish also supports user-defined attributes: users can instantiate new data types by simply implementing five functions for a new class interface. We prove the asymptotic optimality of our compression algorithm and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.