# TphPMF: A microbiome data imputation method using hierarchical Bayesian Probabilistic Matrix Factorization

**Authors:** Xinyu Han, Kai Song

PMC · DOI: 10.1371/journal.pcbi.1012858 · 2025-03-11

## TL;DR

TphPMF is a new machine learning method that improves microbiome data imputation by using phylogenetic relationships, leading to better accuracy in analyzing microbial communities and predicting diseases.

## Contribution

TphPMF introduces a novel Bayesian probabilistic matrix factorization approach that incorporates phylogenetic relationships to improve microbiome data imputation.

## Key findings

- TphPMF outperforms existing methods in recovering missing taxon abundances in microbiome data.
- TphPMF improves detection of differentially abundant taxa when used with DESeq2-phyloseq.
- TphPMF enhances accuracy in predicting disease conditions in datasets related to type 2 diabetes and colorectal cancer.

## Abstract

In microbiome research, data sparsity represents a prevalent and formidable challenge. Sparse data not only compromises the accuracy of statistical analyses but also conceals critical biological relationships, thereby undermining the reliability of the conclusions. To tackle this issue, we introduce a machine learning approach for microbiome data imputation, termed TphPMF. This technique leverages Probabilistic Matrix Factorization, incorporating phylogenetic relationships among microorganisms to establish Bayesian prior distributions. These priors facilitate posterior predictions of potential non-biological zeros. We demonstrate that TphPMF outperforms existing microbiome data imputation methods in accurately recovering missing taxon abundances. Furthermore, TphPMF enhances the efficacy of certain differential abundance analysis methods in detecting differentially abundant (DA) taxa, particularly showing advantages when used in conjunction with DESeq2-phyloseq. Additionally, TphPMF significantly improves the precision of cross-predicting disease conditions in microbiome datasets pertaining to type 2 diabetes and colorectal cancer.

Data sparsity is a significant challenge in microbiome research, as it compromises the accuracy of analyses and obscures important biological relationships. To address this issue, we developed a novel machine learning method called TphPMF, which stands for Phylogenetic Probabilistic Matrix Factorization. This method improves data imputation by incorporating phylogenetic relationships among microorganisms into a probabilistic matrix factorization framework, allowing for more accurate predictions of missing data. Our results demonstrate that TphPMF significantly outperforms existing techniques in recovering missing taxon abundances. Additionally, it enhances the detection of differentially abundant taxa, particularly when used in conjunction with DESeq2-phyloseq, a common differential abundance analysis tool. Moreover, TphPMF substantially improves the accuracy of predicting disease conditions in microbiome datasets related to type 2 diabetes and colorectal cancer. By effectively addressing data sparsity, TphPMF uncovers hidden biological relationships and bolsters the reliability of microbiome analyses. This advancement not only enhances our understanding of microbial communities but also has significant implications for disease prediction and personalized medicine, offering a robust tool for future microbiome research and clinical applications.

## Linked entities

- **Diseases:** type 2 diabetes (MONDO:0005148), colorectal cancer (MONDO:0005575)

## Full-text entities

- **Diseases:** colorectal cancer (MESH:D015179), type 2 diabetes (MESH:D003924)

## Figures

50 figures with captions in the complete paper: https://tomesphere.com/paper/PMC11957397/full.md

---
Source: https://tomesphere.com/paper/PMC11957397