Estimating phylogenetic distances between genomic sequences based on the   length distribution of k-mismatch common substrings

Burkhard Morgenstern; Svenja Sch\"obel; Chris-Andr\'e Leimeister

arXiv:1709.01371·q-bio.PE·September 6, 2017

Estimating phylogenetic distances between genomic sequences based on the length distribution of k-mismatch common substrings

Burkhard Morgenstern, Svenja Sch\"obel, Chris-Andr\'e Leimeister

PDF

Open Access

TL;DR

This paper introduces a method to estimate evolutionary distances between genomic sequences by analyzing the length distribution of k-mismatch common substrings, extending previous exact match approaches.

Contribution

It presents a novel approach to estimate substitutions per site using the length distribution of k-mismatch common substrings, broadening alignment-free sequence comparison methods.

Findings

01

The position of a local maximum in the length distribution indicates evolutionary distance.

02

The method extends previous exact match models to inexact k-mismatch matches.

03

It provides a new tool for phylogenetic analysis based on substring length distributions.

Abstract

Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between two input sequences. Haubold {\em et al.} (2009) showed how the average number of substitutions between two DNA sequences can be estimated based on the average length of exact common substrings. In this paper, we study the length distribution of $k$ -mismatch common substrings between two sequences. We show that the number of substitutions per position that have occurred since two sequences have evolved from their last common ancestor, can be estimated from the position of a local maximum in the length distribution of their $k$ -mismatch common substrings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Phylogenetic Studies · Fractal and DNA sequence analysis · Algorithms and Data Compression