Estimating phylogenetic distances between genomic sequences based on the length distribution of k-mismatch common substrings
Burkhard Morgenstern, Svenja Sch\"obel, Chris-Andr\'e Leimeister

TL;DR
This paper introduces a method to estimate evolutionary distances between genomic sequences by analyzing the length distribution of k-mismatch common substrings, extending previous exact match approaches.
Contribution
It presents a novel approach to estimate substitutions per site using the length distribution of k-mismatch common substrings, broadening alignment-free sequence comparison methods.
Findings
The position of a local maximum in the length distribution indicates evolutionary distance.
The method extends previous exact match models to inexact k-mismatch matches.
It provides a new tool for phylogenetic analysis based on substring length distributions.
Abstract
Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between two input sequences. Haubold {\em et al.} (2009) showed how the average number of substitutions between two DNA sequences can be estimated based on the average length of exact common substrings. In this paper, we study the length distribution of -mismatch common substrings between two sequences. We show that the number of substitutions per position that have occurred since two sequences have evolved from their last common ancestor, can be estimated from the position of a local maximum in the length distribution of their -mismatch common substrings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Fractal and DNA sequence analysis · Algorithms and Data Compression
