The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples
Steven N. Evans, Frederick A. Matsen

TL;DR
This paper introduces a phylogenetic metric based on the Kantorovich-Rubinstein distance to compare microbial communities, providing a rigorous foundation, computational methods, and statistical testing procedures.
Contribution
It establishes that weighted UniFrac is equivalent to the KR distance, extends the metric with uncertainty incorporation and $L^p$ generalizations, and develops permutation tests with Gaussian process approximations.
Findings
Weighted UniFrac equals the KR distance between empirical distributions.
Extensions incorporate uncertainty and generalize to $L^p$ metrics.
Permutation p-values can be approximated via Gaussian process functionals.
Abstract
Using modern technology, it is now common to survey microbial communities by sequencing DNA or RNA extracted in bulk from a given environment. Comparative methods are needed that indicate the extent to which two communities differ given data sets of this type. UniFrac, a method built around a somewhat ad hoc phylogenetics-based distance between two communities, is one of the most commonly used tools for these analyses. We provide a foundation for such methods by establishing that if one equates a metagenomic sample with its empirical distribution on a reference phylogenetic tree, then the weighted UniFrac distance between two samples is just the classical Kantorovich-Rubinstein (KR) distance between the corresponding empirical distributions. We demonstrate that this KR distance and extensions of it that arise from incorporating uncertainty in the location of sample points can be written…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Genomics and Phylogenetic Studies · Gene expression and cancer classification
