Poisson approximation for search of rare words in DNA sequences
Nicolas Vergne (1), Miguel Abadi (2) ((1) Laboratoire Statistique, et G\'enome France, (2) Universidade de Campinas Brazil)

TL;DR
This paper introduces a new psi-mixing method for accurately approximating the distribution of rare word occurrences in DNA sequences modeled by Markov chains, improving tail-bound estimates over traditional methods.
Contribution
The paper develops a local error bound approach using psi-mixing, providing more precise thresholds for over- or under-represented words in biological sequences compared to Chen-Stein methods.
Findings
Psi-mixing method yields better tail-bound accuracy.
New thresholds for word over- or under-representation.
Software PANOW implements the method.
Abstract
Using recent results on the occurrence times of a string of symbols in a stochastic process with mixing properties, we present a new method for the search of rare words in biological sequences generally modelled by a Markov chain. We obtain a bound on the error between the distribution of the number of occurrences of a word in a sequence (under a Markov model) and its Poisson approximation. A global bound is already given by a Chen-Stein method. Our approach, the psi-mixing method, gives local bounds. Since we only need the error in the tails of distribution, the global uniform bound of Chen-Stein is too large and it is a better way to consider local bounds. We search for two thresholds on the number of occurrences from which we can regard the studied word as an over-represented or an under-represented one. A biological role is suggested for these over- or under-represented words. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRNA and protein synthesis mechanisms · Genomics and Phylogenetic Studies · DNA and Biological Computing
