Approximate word matches between two random sequences
Conrad J. Burden, Miriam R. Kantorovitz, Susan R. Wilson

TL;DR
This paper generalizes the $D_2$ statistic for DNA sequences, analyzing approximate word matches with mismatches, and proves its distribution is asymptotically normal, aiding bioinformatics sequence analysis.
Contribution
It introduces a generalized $D_2$ statistic for approximate matches with mismatches and establishes its asymptotic normality under strand symmetric Bernoulli models.
Findings
Expectation of the statistic computed
Variance bounds established
Asymptotic normality proved
Abstract
Given two sequences over a finite alphabet , the statistic is the number of -letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. For , we look at the count of -letter word matches with up to mismatches. For this statistic, we compute the expectation, give upper and lower bounds for the variance and prove its distribution is asymptotically normal.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
