Approximate word matches between two random sequences

Conrad J. Burden; Miriam R. Kantorovitz; Susan R. Wilson

arXiv:0801.3145·math.PR·September 29, 2009

Approximate word matches between two random sequences

Conrad J. Burden, Miriam R. Kantorovitz, Susan R. Wilson

PDF

TL;DR

This paper generalizes the $D_2$ statistic for DNA sequences, analyzing approximate word matches with mismatches, and proves its distribution is asymptotically normal, aiding bioinformatics sequence analysis.

Contribution

It introduces a generalized $D_2$ statistic for approximate matches with mismatches and establishes its asymptotic normality under strand symmetric Bernoulli models.

Findings

01

Expectation of the statistic computed

02

Variance bounds established

03

Asymptotic normality proved

Abstract

Given two sequences over a finite alphabet $L$ , the $D_{2}$ statistic is the number of $m$ -letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the $D_{2}$ statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. For $k < m$ , we look at the count of $m$ -letter word matches with up to $k$ mismatches. For this statistic, we compute the expectation, give upper and lower bounds for the variance and prove its distribution is asymptotically normal.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.