# A negative binomial latent factor model for paired microbiome sequencing data

**Authors:** Hyotae Kim, Nazema Y. Siddiqui, Lisa Karstens, Li Ma

PMC · DOI: 10.1186/s12859-025-06362-3 · BMC Bioinformatics · 2026-01-22

## TL;DR

This paper introduces a statistical model to analyze microbiome data from multiple body sites by capturing shared patterns and improving prediction accuracy.

## Contribution

A novel latent factor model that jointly analyzes paired microbiome data while capturing cross-site dependencies and enabling clustering.

## Key findings

- Ignoring cross-site dependencies leads to reduced regression efficiency in simulations.
- The model detects significant covariate associations in vaginal and urine microbiomes that are missed by separate analyses.
- The model improves predictive performance by enabling microbial abundance prediction across sites.

## Abstract

Microbiome sequencing data are often collected from several body sites and exhibit dependencies. Our objective is to develop a model that enables joint analysis of data from different sites by capturing the underlying cross-site dependencies. The proposed model incorporates (i) latent factors shared across sites to explain common subject effects and to serve as the source of correlation between the sites and (ii) mixtures of latent factors to allow heterogeneity among the subjects in cross-site associations.

Our simulation studies demonstrate that stronger associations between two sites lead to greater efficiency loss in regression analysis when such dependence is ignored in modeling. In a case study involving samples collected from a study on the female urogenital microbiome with aging, our model leads to the detection of covariate associations of the vaginal and urine microbiomes that are otherwise not statistically significant under a similar regression model applied to the two sites separately.

We propose a latent factor model for microbiome sequencing data collected from multiple sites. It captures the presumptive underlying cross-site associations without compromising estimation accuracy or inference efficiency in the absence of such associations. In addition, our proposed model improves predictive performance by enabling the prediction of microbial abundance at one site based on observations from another. We also provide an extended framework that allows for clustering of subjects (samples) and cluster-specific levels of paired association. Under this extended framework, clusters can be classified according to their association strengths.

## Full-text entities

- **Diseases:** JNBM (MESH:D004195), SNBM (MESH:D001010), JNBM_Mix (MESH:D060085), diabetes (MESH:D003920), OAB (MESH:D053201)
- **Chemicals:** GMM (-)
- **Species:** Pseudomonas (RNA similarity group I, genus) [taxon 286], Anaerococcus (genus) [taxon 165779], Klebsiella (genus) [taxon 570], Escherichia coli (E. coli, species) [taxon 562], Streptococcus (genus) [taxon 1301], Corynebacterium (genus) [taxon 1716], Bifidobacterium (genus) [taxon 1678], Lactobacillus (genus) [taxon 1578], Homo sapiens (human, species) [taxon 9606], Gardnerella (genus) [taxon 2701], Aerococcus (genus) [taxon 1375], Prevotella (genus) [taxon 838]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12910815/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12910815/full.md

## References

1 references — full list in the complete paper: https://tomesphere.com/paper/PMC12910815/full.md

---
Source: https://tomesphere.com/paper/PMC12910815