Optimal Computation of Avoided Words
Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S., Iliopoulos, Manal Mohamed, Solon P. Pissis, and Dimitris Polychronopoulos

TL;DR
This paper introduces efficient algorithms to identify words in sequences that are avoided based on their deviation from expected frequency, with applications in DNA analysis, providing theoretical bounds and practical implementation.
Contribution
It presents novel linear-time algorithms for computing avoided words of fixed and variable lengths, along with asymptotic bounds and an open-source implementation.
Findings
Algorithms run in linear and near-linear time.
Provides tight bounds on the number and length of avoided words.
Experimental results demonstrate efficiency on real and synthetic data.
Abstract
The deviation of the observed frequency of a word from its expected frequency in a given sequence is used to determine whether or not the word is avoided. This concept is particularly useful in DNA linguistic analysis. The value of the standard deviation of , denoted by , effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A word of length is a -avoided word in if , for a given threshold . Notice that such a word may be completely absent from . Hence computing all such words na\"{\i}vely can be a very time-consuming procedure, in particular for large . In this article, we propose an -time and -space algorithm to compute all -avoided words of length in a given sequence of length over a fixed-sized alphabet. We also present a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · DNA and Biological Computing · semigroups and automata theory
