Optimal Computation of Overabundant Words
Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S., Iliopoulos, Manal Mohamed, Solon P. Pissis, Dimitris Polychronopoulos

TL;DR
This paper introduces an optimal linear-time algorithm for identifying overabundant words in sequences, extending previous work on avoided words, with proven combinatorial properties and practical efficiency demonstrated through experiments.
Contribution
It presents the first linear-time, linear-space algorithm for computing overabundant words, based on a novel combinatorial property of suffix trees.
Findings
Algorithm runs in O(n) time and space.
Number of overabundant words is bounded by 3n-4.
Experimental results confirm practical efficiency.
Abstract
The observed frequency of the longest proper prefix, the longest proper suffix, and the longest infix of a word in a given sequence can be used for classifying as avoided or overabundant. The definitions used for the expectation and deviation of in this statistical model were described and biologically justified by Brendel et al. (J Biomol Struct Dyn 1986). We have very recently introduced a time-optimal algorithm for computing all avoided words of a given sequence over an integer alphabet (Algorithms Mol Biol 2017). In this article, we extend this study by presenting an -time and -space algorithm for computing all overabundant words in a sequence of length over an integer alphabet. Our main result is based on a new non-trivial combinatorial property of the suffix tree of : the number of distinct factors of …
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · RNA and protein synthesis mechanisms · Machine Learning in Bioinformatics
