Computing matching statistics on Wheeler DFAs

Alessio Conte; Nicola Cotumaccio; Travis Gagie; Giovanni Manzini,; Nicola Prezza; Marinella Sciortino

arXiv:2301.05338·cs.DS·January 16, 2023

Computing matching statistics on Wheeler DFAs

Alessio Conte, Nicola Cotumaccio, Travis Gagie, Giovanni Manzini,, Nicola Prezza, Marinella Sciortino

PDF

Open Access

TL;DR

This paper generalizes an efficient string matching statistics algorithm to Wheeler automata, introducing a new LCP array concept for these automata, advancing suffix tree functionalities to labeled graph structures.

Contribution

It extends the matching statistics algorithm from strings to Wheeler automata and introduces an LCP array for these automata, enabling suffix tree-like operations on labeled graphs.

Findings

01

Generalized matching statistics computation to Wheeler automata

02

Introduced a novel LCP array for Wheeler automata

03

Paved the way for suffix tree functionalities on labeled graphs

Abstract

Matching statistics were introduced to solve the approximate string matching problem, which is a recurrent subroutine in bioinformatics applications. In 2010, Ohlebusch et al. [SPIRE 2010] proposed a time and space efficient algorithm for computing matching statistics which relies on some components of a compressed suffix tree - notably, the longest common prefix (LCP) array. In this paper, we show how their algorithm can be generalized from strings to Wheeler deterministic finite automata. Most importantly, we introduce a notion of LCP array for Wheeler automata, thus establishing a first clear step towards extending (compressed) suffix tree functionalities to labeled graphs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Network Packet Processing and Optimization · DNA and Biological Computing