Phase transition in the sample complexity of likelihood-based phylogeny inference
Sebastien Roch, Allan Sly

TL;DR
This paper establishes tight bounds on the data needed for maximum likelihood phylogeny inference, showing it is efficient and feasible under certain evolutionary models and conditions.
Contribution
It provides the first matching upper and lower bounds on sequence-length requirements for maximum likelihood phylogeny reconstruction, especially near the Kesten-Stigum threshold.
Findings
Sequence-length requirement is logarithmic in the number of tips below the Kesten-Stigum threshold.
Sequence-length requirement is polynomial in the number of tips in general.
Maximum likelihood can be computed efficiently on random data under certain conditions.
Abstract
Reconstructing evolutionary trees from molecular sequence data is a fundamental problem in computational biology. Stochastic models of sequence evolution are closely related to spin systems that have been extensively studied in statistical physics and that connection has led to important insights on the theoretical properties of phylogenetic reconstruction algorithms as well as the development of new inference methods. Here, we study maximum likelihood, a classical statistical technique which is perhaps the most widely used in phylogenetic practice because of its superior empirical accuracy. At the theoretical level, except for its consistency, that is, the guarantee of eventual correct reconstruction as the size of the input data grows, much remains to be understood about the statistical properties of maximum likelihood in this context. In particular, the best bounds on the sample…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
