Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction
Younhun Kim, Georg K. Gerber, Travis E. Gibson

TL;DR
This paper introduces set-aggregated genome embeddings (SAGE) leveraging genomic language models to predict microbiome community abundance, showing improved generalization over classical methods.
Contribution
The work presents a novel set-aggregation approach using genomic language models for microbiome abundance prediction, enhancing generalization and interpretability.
Findings
SAGE improves prediction accuracy on novel genomes.
Community-level latent representations enhance model performance.
Intermediate transformations reveal differences between GLM embedding choices.
Abstract
Microbiome functions are encoded within the genes of the community-wide metagenome. A natural question is whether properties of a microbial community can be predicted just from knowing the raw DNA sequences of its members. In this work, we employ set-aggregated genome embeddings (SAGE) to predict community-level abundance profiles, exploiting the few-shot learning capabilities of genomic language models (GLMs). We benchmark this approach to show improved generalization on novel genomes compared to classical bioinformatics approaches. Model ablation shows that community-level latent representations directly result in improved performance. Lastly, we demonstrate the benefits of intermediate transformations between latent representations and demonstrate the differences between GLM embedding choices.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
