Practical and Effective Re-Pair Compression
Philip Bille, Inge Li G{\o}rtz, Nicola Prezza

TL;DR
This paper presents a practical implementation of Re-Pair compression that improves space efficiency and introduces a linear-time heuristic for better grammar encoding, achieving near-optimal compression on real datasets.
Contribution
The authors develop a practical, space-efficient Re-Pair implementation and a linear-time heuristic for grammar encoding that approaches the information-theoretic lower bound.
Findings
Improved working space to (1.5+ε)n words for Re-Pair.
Grammar encoding uses only 2.8% more bits than the theoretical minimum.
In tests, the compressor outperforms 7-Zip in half of the cases.
Abstract
Re-Pair is an efficient grammar compressor that operates by recursively replacing high-frequency character pairs with new grammar symbols. The most space-efficient linear-time algorithm computing Re-Pair uses words on top of the re-writable text (of length and stored in words), for any constant ; in practice however, this solution uses complex sub-procedures preventing it from being practical. In this paper, we present an implementation of the above-mentioned result making use of more practical solutions; our tool further improves the working space to words (text included), for some small constant . As a second contribution, we focus on compact representations of the output grammar. The lower bound for storing a grammar with rules is bits, and the most efficient encoding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Network Packet Processing and Optimization
