Principal components analysis in the space of phylogenetic trees
Tom M. W. Nye

TL;DR
This paper introduces a novel geometrical method for applying Principal Components Analysis to collections of phylogenetic trees, enabling the analysis of variation in tree topology and branch lengths.
Contribution
It develops a new approach to PCA in tree-space using geodesic paths, addressing the challenge of non-vector space structure in phylogenetic trees.
Findings
Identifies principal paths that explain main variation in tree data
Demonstrates method on simulated and real gene trees
Reveals sources of variation in topology and branch lengths
Abstract
Phylogenetic analysis of DNA or other data commonly gives rise to a collection or sample of inferred evolutionary trees. Principal Components Analysis (PCA) cannot be applied directly to collections of trees since the space of evolutionary trees on a fixed set of taxa is not a vector space. This paper describes a novel geometrical approach to PCA in tree-space that constructs the first principal path in an analogous way to standard linear Euclidean PCA. Given a data set of phylogenetic trees, a geodesic principal path is sought that maximizes the variance of the data under a form of projection onto the path. Due to the high dimensionality of tree-space and the nonlinear nature of this problem, the computational complexity is potentially very high, so approximate optimization algorithms are used to search for the optimal path. Principal paths identified in this way reveal and quantify…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
