Compressing Tabular Data via Latent Variable Estimation
Andrea Montanari, Eric Weiner

TL;DR
This paper presents a lossless compression method for tabular data that estimates latent variables, partitions data, and applies sequential coding, outperforming classical methods in achieving optimal compression rates.
Contribution
Introduces a novel latent variable-based compression algorithm for tabular data with theoretical analysis and empirical validation, outperforming classical schemes.
Findings
The proposed method achieves the optimal entropy rate.
Classical schemes like Lempel-Ziv do not reach the optimal rate.
The model satisfies an asymptotic equipartition property.
Abstract
Data used for analytics and machine learning often take the form of tables with categorical entries. We introduce a family of lossless compression algorithms for such data that proceed in four steps: Estimate latent variables associated to rows and columns; Partition the table in blocks according to the row/column latents; Apply a sequential (e.g. Lempel-Ziv) coder to each of the blocks; Append a compressed encoding of the latents. We evaluate it on several benchmark datasets, and study optimal compression in a probabilistic model for that tabular data, whereby latent values are independent and table entries are conditionally independent given the latent values. We prove that the model has a well defined entropy rate and satisfies an asymptotic equipartition property. We also prove that classical compression schemes such as Lempel-Ziv and finite-state…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Advanced Data Compression Techniques · Computability, Logic, AI Algorithms
