Computing all-vs-all MEMs in grammar-compressed text

Diego Diaz-Dominguez; Leena Salmela

arXiv:2306.16815·cs.IR·June 30, 2023

Computing all-vs-all MEMs in grammar-compressed text

Diego Diaz-Dominguez, Leena Salmela

PDF

Open Access

TL;DR

This paper introduces a compression-aware method to efficiently compute all-vs-all maximal exact matches in repetitive texts by constructing a special fix-free grammar, enabling incremental MEM computation without decompression.

Contribution

It presents a novel fix-free grammar construction from the text and an incremental MEM algorithm that operates directly on the grammar, improving efficiency in repetitive collections.

Findings

01

Runs in linear time and space to build the grammar

02

Computes MEMs in O(G + occ) time, with G as grammar size

03

Uses O(log G (G + occ)) bits of memory

Abstract

We describe a compression-aware method to compute all-vs-all maximal exact matches (MEM) among strings of a repetitive collection $T$ . The key concept in our work is the construction of a fully-balanced grammar $G$ from $T$ that meets a property that we call \emph{fix-free}: the expansions of the nonterminals that have the same height in the parse tree form a fix-free set (i.e., prefix-free and suffix-free). The fix-free property allows us to compute the MEMs of $T$ incrementally over $G$ using a standard suffix-tree-based MEM algorithm, which runs on a subset of grammar rules at a time and does not decompress nonterminals. By modifying the locally-consistent grammar of Christiansen et al 2020., we show how we can build $G$ from $T$ in linear time and space. We also demonstrate that our MEM algorithm runs on top of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Network Packet Processing and Optimization