Compressing Tabular Data via Latent Variable Estimation

Andrea Montanari; Eric Weiner

arXiv:2302.09780·cs.IT·February 21, 2023

Compressing Tabular Data via Latent Variable Estimation

Andrea Montanari, Eric Weiner

PDF

Open Access

TL;DR

This paper presents a lossless compression method for tabular data that estimates latent variables, partitions data, and applies sequential coding, outperforming classical methods in achieving optimal compression rates.

Contribution

Introduces a novel latent variable-based compression algorithm for tabular data with theoretical analysis and empirical validation, outperforming classical schemes.

Findings

01

The proposed method achieves the optimal entropy rate.

02

Classical schemes like Lempel-Ziv do not reach the optimal rate.

03

The model satisfies an asymptotic equipartition property.

Abstract

Data used for analytics and machine learning often take the form of tables with categorical entries. We introduce a family of lossless compression algorithms for such data that proceed in four steps: $(i)$ Estimate latent variables associated to rows and columns; $(ii)$ Partition the table in blocks according to the row/column latents; $(iii)$ Apply a sequential (e.g. Lempel-Ziv) coder to each of the blocks; $(i v)$ Append a compressed encoding of the latents. We evaluate it on several benchmark datasets, and study optimal compression in a probabilistic model for that tabular data, whereby latent values are independent and table entries are conditionally independent given the latent values. We prove that the model has a well defined entropy rate and satisfies an asymptotic equipartition property. We also prove that classical compression schemes such as Lempel-Ziv and finite-state…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Advanced Data Compression Techniques · Computability, Logic, AI Algorithms