Clustering genes of common evolutionary history
Kevin Gori, Tomasz Suchan, Nadir Alvarez, Nick Goldman, Christophe, Dessimoz

TL;DR
This paper evaluates clustering methods for genes with shared evolutionary history, introducing statistical tests for optimal cluster number, and demonstrates improved phylogenetic analysis accuracy.
Contribution
It systematically compares clustering methods for phylogenetic loci and introduces new statistical tests for determining the optimal number of clusters.
Findings
Branch length-aware distances perform best
Spectral clustering and Ward's method are most effective
New statistical tests outperform silhouette criterion
Abstract
Phylogenetic inference can potentially result in a more accurate tree using data from multiple loci. However, if the loci are incongruent--due to events such as incomplete lineage sorting or horizontal gene transfer--it can be misleading to infer a single tree. To address this, many previous contributions have taken a mechanistic approach, by modelling specific processes. Alternatively, one can cluster loci without assuming how these incongruencies might arise. Such "process-agnostic" approaches typically infer a tree for each locus and cluster these. There are, however, many possible combinations of tree distance and clustering methods; their comparative performance in the context of tree incongruence is largely unknown. Furthermore, because standard model selection criteria such as AIC cannot be applied to problems with a variable number of topologies, the issue of inferring the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
