Computing Matching Statistics on Repetitive Texts
Younan Gao

TL;DR
This paper introduces new data structures for efficiently computing matching statistics on highly repetitive texts, leveraging measures of repetitiveness like string attractors and other metrics.
Contribution
The paper presents three novel data structures similar to LZ-compressed indexes tailored for repetitive texts, optimizing space based on advanced repetitiveness measures.
Findings
All data structures operate within space bounds related to string attractors and other measures.
They enable efficient computation of matching statistics on repetitive texts.
The methods improve upon previous approaches in handling highly repetitive data.
Abstract
Computing the {\em matching statistics} of a string with respect to a text is a fundamental problem which has application to genome sequence comparison. In this paper, we study the problem of computing the matching statistics upon highly repetitive texts. We design three different data structures that are similar to LZ-compressed indexes. The space costs of all of them can be measured by , the size of the smallest string attractor [STOC'2018] and , a better measure of repetitiveness [LATIN'2020].
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Genomics and Phylogenetic Studies
