Grammar Compression By Induced Suffix Sorting
Daniel S. N. Nunes, Felipe A. Louza, Simon Gog, Mauricio, Ayala-Rinc\'on, Gonzalo Navarro

TL;DR
GCIS is a grammar compression algorithm based on induced suffix sorting that offers low space and time for compression, making it competitive with standard compressors and suitable for direct substring access in compressed form.
Contribution
This work introduces GCIS, a novel grammar compression method leveraging induced suffix sorting, providing efficient compression with practical advantages for substring access.
Findings
GCIS achieves competitive compression ratios.
GCIS requires less space and time for compression.
GCIS is effective for large and repetitive texts.
Abstract
A grammar compression algorithm, called GCIS, is introduced in this work. GCIS is based on the induced suffix sorting algorithm SAIS, presented by Nong et al. in 2009. The proposed solution builds on the factorization performed by SAIS during suffix sorting. A context-free grammar is used to replace factors by non-terminals. The algorithm is then recursively applied on the shorter sequence of non-terminals. The resulting grammar is encoded by exploiting some redundancies, such as common prefixes between right-hands of rules, sorted according to SAIS. GCIS excels for its low space and time required for compression while obtaining competitive compression ratios. Our experiments on regular and repetitive, moderate and very large texts, show that GCIS stands as a very convenient choice compared to well-known compressors such as Gzip, 7-Zip, and RePair, the gold standard in grammar…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Network Packet Processing and Optimization
