Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing
Diego Diaz-Dominguez

TL;DR
This paper introduces a parallelizable text compression algorithm for terabyte-scale datasets using stable local consistency in grammars, enabling efficient compression of massive data like bacterial genomes.
Contribution
The authors propose a novel stable local consistency concept that allows fully parallel grammar-based compression without synchronization, improving scalability for large datasets.
Findings
Processed 7.9 TB of bacterial genomes in 9 hours
Achieved 85-fold compression ratio
Used 16 threads and 0.43 bits per symbol of memory
Abstract
We present a highly parallelizable text compression algorithm that scales efficiently to terabyte-sized datasets. Our method builds on locally consistent grammars, a lightweight form of compression, combined with simple recompression techniques to achieve further space reductions. Locally consistent grammar algorithms are particularly suitable for scaling, as they need minimal satellite information to compact the text. We introduce a novel concept to enable parallelisation, stable local consistency. A grammar algorithm ALG is stable, if for any pattern occurring in a collection , the instances independently produce cores for with the same topology. In a locally consistent grammar, the core of is a subset of nodes and edges in 's parse tree that remains the same in all the occurrences…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Speech Recognition and Synthesis
