Evolutionary distances in the twilight zone -- a rational kernel approach
Roland F. Schwarz, William Fletcher, Frank F\"orster, Benjamin Merget,, Matthias Wolf, J\"org Schultz, Florian Markowetz

TL;DR
This paper introduces a biologically motivated, alignment-free evolutionary distance metric using finite-state transducers, improving phylogenetic reconstruction accuracy for highly divergent sequences.
Contribution
It presents a novel finite-state transducer-based similarity score that models substitutions and indels without requiring multiple sequence alignments.
Findings
More accurate phylogenetic reconstructions in simulations
Effective on real-world divergent sequences
Suitable for large datasets
Abstract
Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignment-free methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
