Principal components analysis in the space of phylogenetic trees

Tom M. W. Nye

arXiv:1202.5132·math.ST·February 24, 2012

Principal components analysis in the space of phylogenetic trees

Tom M. W. Nye

PDF

TL;DR

This paper introduces a novel geometrical method for applying Principal Components Analysis to collections of phylogenetic trees, enabling the analysis of variation in tree topology and branch lengths.

Contribution

It develops a new approach to PCA in tree-space using geodesic paths, addressing the challenge of non-vector space structure in phylogenetic trees.

Findings

01

Identifies principal paths that explain main variation in tree data

02

Demonstrates method on simulated and real gene trees

03

Reveals sources of variation in topology and branch lengths

Abstract

Phylogenetic analysis of DNA or other data commonly gives rise to a collection or sample of inferred evolutionary trees. Principal Components Analysis (PCA) cannot be applied directly to collections of trees since the space of evolutionary trees on a fixed set of taxa is not a vector space. This paper describes a novel geometrical approach to PCA in tree-space that constructs the first principal path in an analogous way to standard linear Euclidean PCA. Given a data set of phylogenetic trees, a geodesic principal path is sought that maximizes the variance of the data under a form of projection onto the path. Due to the high dimensionality of tree-space and the nonlinear nature of this problem, the computational complexity is potentially very high, so approximate optimization algorithms are used to search for the optimal path. Principal paths identified in this way reveal and quantify…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.