Empirical distribution of k-word matches in biological sequences
Sylvain Foret, Susan R. Wilson, Conrad J. Burden

TL;DR
This paper provides practical statistical approximations for the distribution of the D_2 statistic, an alignment-free method for comparing biological sequences based on shared k-word counts, enhancing its application in sequence analysis.
Contribution
It offers the first usable approximations of D_2's distribution for common biological sequence parameters, bridging theoretical and practical needs.
Findings
Provides statistical models for D_2 distribution in biological sequences
Enables more accurate significance testing in sequence comparison
Improves the reliability of clustering in large sequence databases
Abstract
This study focuses on an alignment-free sequence comparison method: the number of words of length k shared between two sequences, also known as the D_2 statistic. The advantages of the use of this statistic over alignment-based methods are firstly that it does not assume that homologous segments are contiguous, and secondly that the algorithm is computationally extremely fast, the runtime being proportional to the size of the sequence under scrutiny. Existing applications of the D_2 statistic include the clustering of related sequences in large EST databases such as the STACK database. Such applications have typically relied on heuristics without any statistical basis. Rigorous statistical characterisations of the distribution of D_2 have subsequently been undertaken, but have focussed on the distribution's asymptotic behaviour, leaving the distribution of D_2 uncharacterised for most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
