Empirical distribution of k-word matches in biological sequences

Sylvain Foret; Susan R. Wilson; Conrad J. Burden

arXiv:0803.2085·q-bio.QM·September 8, 2009

Empirical distribution of k-word matches in biological sequences

Sylvain Foret, Susan R. Wilson, Conrad J. Burden

PDF

TL;DR

This paper provides practical statistical approximations for the distribution of the D_2 statistic, an alignment-free method for comparing biological sequences based on shared k-word counts, enhancing its application in sequence analysis.

Contribution

It offers the first usable approximations of D_2's distribution for common biological sequence parameters, bridging theoretical and practical needs.

Findings

01

Provides statistical models for D_2 distribution in biological sequences

02

Enables more accurate significance testing in sequence comparison

03

Improves the reliability of clustering in large sequence databases

Abstract

This study focuses on an alignment-free sequence comparison method: the number of words of length k shared between two sequences, also known as the D_2 statistic. The advantages of the use of this statistic over alignment-based methods are firstly that it does not assume that homologous segments are contiguous, and secondly that the algorithm is computationally extremely fast, the runtime being proportional to the size of the sequence under scrutiny. Existing applications of the D_2 statistic include the clustering of related sequences in large EST databases such as the STACK database. Such applications have typically relied on heuristics without any statistical basis. Rigorous statistical characterisations of the distribution of D_2 have subsequently been undertaken, but have focussed on the distribution's asymptotic behaviour, leaving the distribution of D_2 uncharacterised for most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.