Multi-sample estimation of centered log-ratio matrix in microbiome studies
Yezheng Li, Hongzhe Li, Yuanpei Cao

TL;DR
This paper introduces a multi-sample estimation method for the centered log-ratio matrix in microbiome studies, addressing zero counts and improving compositional analysis accuracy.
Contribution
It proposes a regularized maximum likelihood estimator with nuclear norm penalty for clr matrices, leveraging low-rank structure across multiple samples.
Findings
Outperforms naive estimators in simulations
Effective on real microbiome datasets
Provides theoretical error bounds
Abstract
In microbiome studies, one of the ways of studying bacterial abundances is to estimate bacterial composition based on the sequencing read counts. Various transformations are then applied to such compositional data for downstream statistical analysis, among which the centered log-ratio (clr) transformation is most commonly used. Due to limited sequencing depth and DNA dropouts, many rare bacterial taxa might not be captured in the final sequencing reads, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which makes clr transformation infeasible. This paper proposes a multi-sample approach to estimation of the clr matrix directly in order to borrow information across samples and across species. Empirical results from real datasets suggest that the clr matrix over multiple samples is approximately low rank, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGut microbiota and health · Geochemistry and Geologic Mapping · Bayesian Methods and Mixture Models
