Optimal Computation of Overabundant Words

Yannis Almirantis; Panagiotis Charalampopoulos; Jia Gao; Costas S.; Iliopoulos; Manal Mohamed; Solon P. Pissis; Dimitris Polychronopoulos

arXiv:1705.03385·cs.DS·May 10, 2017·2 cites

Optimal Computation of Overabundant Words

Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S., Iliopoulos, Manal Mohamed, Solon P. Pissis, Dimitris Polychronopoulos

PDF

Open Access

TL;DR

This paper introduces an optimal linear-time algorithm for identifying overabundant words in sequences, extending previous work on avoided words, with proven combinatorial properties and practical efficiency demonstrated through experiments.

Contribution

It presents the first linear-time, linear-space algorithm for computing overabundant words, based on a novel combinatorial property of suffix trees.

Findings

01

Algorithm runs in O(n) time and space.

02

Number of overabundant words is bounded by 3n-4.

03

Experimental results confirm practical efficiency.

Abstract

The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word $w$ in a given sequence $x$ can be used for classifying $w$ as avoided or overabundant. The definitions used for the expectation and deviation of $w$ in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017). In this article, we extend this study by presenting an $O (n)$ -time and $O (n)$ -space algorithm for computing all overabundant words in a sequence $x$ of length $n$ over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree $T$ of $x$ : the number of distinct factors of $x$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · RNA and protein synthesis mechanisms · Machine Learning in Bioinformatics