Computing Matching Statistics on Repetitive Texts

Younan Gao

arXiv:2111.00376·cs.DS·January 14, 2022

Computing Matching Statistics on Repetitive Texts

Younan Gao

PDF

Open Access

TL;DR

This paper introduces new data structures for efficiently computing matching statistics on highly repetitive texts, leveraging measures of repetitiveness like string attractors and other metrics.

Contribution

The paper presents three novel data structures similar to LZ-compressed indexes tailored for repetitive texts, optimizing space based on advanced repetitiveness measures.

Findings

01

All data structures operate within space bounds related to string attractors and other measures.

02

They enable efficient computation of matching statistics on repetitive texts.

03

The methods improve upon previous approaches in handling highly repetitive data.

Abstract

Computing the {\em matching statistics} of a string $P [1.. m]$ with respect to a text $T [1.. n]$ is a fundamental problem which has application to genome sequence comparison. In this paper, we study the problem of computing the matching statistics upon highly repetitive texts. We design three different data structures that are similar to LZ-compressed indexes. The space costs of all of them can be measured by $γ$ , the size of the smallest string attractor [STOC'2018] and $δ$ , a better measure of repetitiveness [LATIN'2020].

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Natural Language Processing Techniques · Genomics and Phylogenetic Studies