Optimal Computation of Avoided Words

Yannis Almirantis; Panagiotis Charalampopoulos; Jia Gao; Costas S.; Iliopoulos; Manal Mohamed; Solon P. Pissis; and Dimitris Polychronopoulos

arXiv:1604.08760·cs.DS·May 2, 2016

Optimal Computation of Avoided Words

Yannis Almirantis, Panagiotis Charalampopoulos, Jia Gao, Costas S., Iliopoulos, Manal Mohamed, Solon P. Pissis, and Dimitris Polychronopoulos

PDF

Open Access 1 Repo

TL;DR

This paper introduces efficient algorithms to identify words in sequences that are avoided based on their deviation from expected frequency, with applications in DNA analysis, providing theoretical bounds and practical implementation.

Contribution

It presents novel linear-time algorithms for computing avoided words of fixed and variable lengths, along with asymptotic bounds and an open-source implementation.

Findings

01

Algorithms run in linear and near-linear time.

02

Provides tight bounds on the number and length of avoided words.

03

Experimental results demonstrate efficiency on real and synthetic data.

Abstract

The deviation of the observed frequency of a word $w$ from its expected frequency in a given sequence $x$ is used to determine whether or not the word is avoided. This concept is particularly useful in DNA linguistic analysis. The value of the standard deviation of $w$ , denoted by $s t d (w)$ , effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A word $w$ of length $k > 2$ is a $ρ$ -avoided word in $x$ if $s t d (w) \leq ρ$ , for a given threshold $ρ < 0$ . Notice that such a word may be completely absent from $x$ . Hence computing all such words na\"{\i}vely can be a very time-consuming procedure, in particular for large $k$ . In this article, we propose an $O (n)$ -time and $O (n)$ -space algorithm to compute all $ρ$ -avoided words of length $k$ in a given sequence $x$ of length $n$ over a fixed-sized alphabet. We also present a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

solonas13/aw
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · DNA and Biological Computing · semigroups and automata theory