Identifiability and inference of non-parametric rates-across-sites models on large-scale phylogenies
Elchanan Mossel, Sebastien Roch

TL;DR
This paper introduces a new method for estimating rates-across-sites models in large phylogenies, demonstrating that such models are identifiable and providing sequence-length requirements for accurate tree reconstruction.
Contribution
The paper presents a novel site clustering algorithm for rates-across-sites models, enabling standard phylogenetic reconstruction methods to be applied effectively.
Findings
Large phylogenies are identifiable under rate variation.
A site clustering algorithm effectively groups sites by mutation rate.
Sequence-length requirements for high-probability reconstruction are derived.
Abstract
Mutation rate variation across loci is well known to cause difficulties, notably identifiability issues, in the reconstruction of evolutionary trees from molecular sequences. Here we introduce a new approach for estimating general rates-across-sites models. Our results imply, in particular, that large phylogenies are typically identifiable under rate variation. We also derive sequence-length requirements for high-probability reconstruction. Our main contribution is a novel algorithm that clusters sites according to their mutation rate. Following this site clustering step, standard reconstruction techniques can be used to recover the phylogeny. Our results rely on a basic insight: that, for large trees, certain site statistics experience concentration-of-measure phenomena.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
