Sequence alignment, mutual information, and dissimilarity measures for constructing phylogenies
Orion Penner, Peter Grassberger, Maya Paczuski

TL;DR
This paper introduces a robust mutual information-based method for sequence comparison that improves phylogenetic analysis accuracy over traditional distances, using alignment algorithms and information theory.
Contribution
It presents a novel approach to estimate mutual information from global alignments, enhancing phylogenetic distance measures with a simple, effective modification.
Findings
Mutual information estimates closely match alignment-free methods for mitochondrial DNA.
Proposed measures outperform traditional distances like Kimura and log-det.
Single-letter Shannon entropy-based measure performs well across animal species.
Abstract
Existing sequence alignment algorithms use heuristic scoring schemes which cannot be used as objective distance metrics. Therefore one relies on measures like the p- or log-det distances, or makes explicit, and often simplistic, assumptions about sequence evolution. Information theory provides an alternative, in the form of mutual information (MI) which is, in principle, an objective and model independent similarity measure. MI can be estimated by concatenating and zipping sequences, yielding thereby the "normalized compression distance". So far this has produced promising results, but with uncontrolled errors. We describe a simple approach to get robust estimates of MI from global pairwise alignments. Using standard alignment algorithms, this gives for animal mitochondrial DNA estimates that are strikingly close to estimates obtained from the alignment free methods mentioned above. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
