Efficient Index for Weighted Sequences

Carl Barton; Tomasz Kociumaka; Solon P. Pissis; Jakub Radoszewski

arXiv:1602.01116·cs.DS·February 4, 2016

Efficient Index for Weighted Sequences

Carl Barton, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski

PDF

TL;DR

This paper introduces an efficient index for weighted sequences that enables fast pattern matching, prefix table computation, and cover detection, significantly improving performance over previous methods.

Contribution

It presents a novel $O(nz)$-time index construction for weighted sequences, enhancing query efficiency and related computations compared to prior approaches.

Findings

01

Achieves $O(nz)$ construction time for the index

02

Answers pattern matching queries in optimal time

03

Improves performance over previous methods by a factor of $z \, \log z$

Abstract

The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alphabet a probability of occurrence of this letter at this position is given. Sequences of this type, also called position weight matrices, are commonly used to represent imprecise or uncertain data. A weighted sequence may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. Given a probability threshold $1/ z$ , we say that a pattern string $P$ matches a weighted text at position $i$ if the product of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.