Minimizing the average distance to a closest leaf in a phylogenetic tree
Frederick A. Matsen, Aaron Gallagher, Connor McCoy

TL;DR
This paper formalizes the problem of selecting a representative subset of sequences in a phylogenetic tree to minimize average distance to the closest leaf, introduces algorithms including an exact dynamic programming method, and compares their effectiveness on simulated and real data.
Contribution
It develops an exact dynamic programming algorithm for ADCL minimization and compares it with heuristic methods, providing insights into their performance and practical utility.
Findings
Exact algorithm outperforms heuristics on small trees
PAM heuristic is faster but less accurate for larger trees
ADCL criterion reduces chimeric sequences in real data
Abstract
When performing an analysis on a collection of molecular sequences, it can be convenient to reduce the number of sequences under consideration while maintaining some characteristic of a larger collection of sequences. For example, one may wish to select a subset of high-quality sequences that represent the diversity of a larger collection of sequences. One may also wish to specialize a large database of characterized "reference sequences" to a smaller subset that is as close as possible on average to a collection of "query sequences" of interest. Such a representative subset can be useful whenever one wishes to find a set of reference sequences that is appropriate to use for comparative analysis of environmentally-derived sequences, such as for selecting "reference tree" sequences for phylogenetic placement of metagenomic reads. In this paper we formalize these problems in terms of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Genetic diversity and population structure · Bioinformatics and Genomic Networks
