Computing all-vs-all MEMs in grammar-compressed text
Diego Diaz-Dominguez, Leena Salmela

TL;DR
This paper introduces a compression-aware method to efficiently compute all-vs-all maximal exact matches in repetitive texts by constructing a special fix-free grammar, enabling incremental MEM computation without decompression.
Contribution
It presents a novel fix-free grammar construction from the text and an incremental MEM algorithm that operates directly on the grammar, improving efficiency in repetitive collections.
Findings
Runs in linear time and space to build the grammar
Computes MEMs in O(G + occ) time, with G as grammar size
Uses O(log G (G + occ)) bits of memory
Abstract
We describe a compression-aware method to compute all-vs-all maximal exact matches (MEM) among strings of a repetitive collection . The key concept in our work is the construction of a fully-balanced grammar from that meets a property that we call \emph{fix-free}: the expansions of the nonterminals that have the same height in the parse tree form a fix-free set (i.e., prefix-free and suffix-free). The fix-free property allows us to compute the MEMs of incrementally over using a standard suffix-tree-based MEM algorithm, which runs on a subset of grammar rules at a time and does not decompress nonterminals. By modifying the locally-consistent grammar of Christiansen et al 2020., we show how we can build from in linear time and space. We also demonstrate that our MEM algorithm runs on top of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Network Packet Processing and Optimization
