Pcodec: Better Compression for Numerical Sequences
Martin Loncaric, Niels Jeppesen, Ben Zinberg

TL;DR
Pcodec is a lossless compression method for numerical sequences that uses a novel binning algorithm to efficiently approximate true entropy, outperforming existing codecs on real-world datasets.
Contribution
The paper introduces Pcodec, featuring a unique binning algorithm and preprocessing steps, achieving superior compression ratios for numerical data.
Findings
Pcodec achieves 29-94% higher compression ratios than existing codecs.
It converges to the true entropy of SIID integers with a proven mathematical bound.
Pcodec uses less compression time while providing better results.
Abstract
We present Pcodec (Pco), a format and algorithm for losslessly compressing numerical (float or integer) sequences. Pco's core and most novel component is a binning algorithm that quickly converges to the true entropy of smoothly, independently, and identically distributed (SIID) integers. We mathematically prove this convergence with a practical bound. To accommodate data this is not SIID, Pco has two opinionated preprocessing steps. The first step, Pco's mode, decomposes the numbers into more smoothly distributed integer latent variables. The second step, delta encoding, makes the latents more independently and identically distributed. We demonstrate that Pco achieves 29-94% higher compression ratio than other numerical codecs on six real-world columnar datasets while using less compression time.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNumerical Methods and Algorithms · Digital Filter Design and Implementation · Parallel Computing and Optimization Techniques
