Efficient terabyte-scale text compression via stable local consistency   and parallel grammar processing

Diego Diaz-Dominguez

arXiv:2411.12439·cs.DS·February 26, 2025

Efficient terabyte-scale text compression via stable local consistency and parallel grammar processing

Diego Diaz-Dominguez

PDF

Open Access 1 Repo

TL;DR

This paper introduces a parallelizable text compression algorithm for terabyte-scale datasets using stable local consistency in grammars, enabling efficient compression of massive data like bacterial genomes.

Contribution

The authors propose a novel stable local consistency concept that allows fully parallel grammar-based compression without synchronization, improving scalability for large datasets.

Findings

01

Processed 7.9 TB of bacterial genomes in 9 hours

02

Achieved 85-fold compression ratio

03

Used 16 threads and 0.43 bits per symbol of memory

Abstract

We present a highly parallelizable text compression algorithm that scales efficiently to terabyte-sized datasets. Our method builds on locally consistent grammars, a lightweight form of compression, combined with simple recompression techniques to achieve further space reductions. Locally consistent grammar algorithms are particularly suitable for scaling, as they need minimal satellite information to compact the text. We introduce a novel concept to enable parallelisation, stable local consistency. A grammar algorithm ALG is stable, if for any pattern $P$ occurring in a collection $T = {T_{1}, T_{2}, \dots, T_{k}}$ , the instances $A L G (T_{1}), A L G (T_{2}), \dots, A L G (T_{k})$ independently produce cores for $P$ with the same topology. In a locally consistent grammar, the core of $P$ is a subset of nodes and edges in $T$ 's parse tree that remains the same in all the occurrences…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ddiazdom/lcg
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Speech Recognition and Synthesis