Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction

Younhun Kim; Georg K. Gerber; Travis E. Gibson

arXiv:2605.12286·q-bio.GN·May 13, 2026

Set-Aggregated Genome Embeddings for Microbiome Abundance Prediction

Younhun Kim, Georg K. Gerber, Travis E. Gibson

PDF

TL;DR

This paper introduces set-aggregated genome embeddings (SAGE) leveraging genomic language models to predict microbiome community abundance, showing improved generalization over classical methods.

Contribution

The work presents a novel set-aggregation approach using genomic language models for microbiome abundance prediction, enhancing generalization and interpretability.

Findings

01

SAGE improves prediction accuracy on novel genomes.

02

Community-level latent representations enhance model performance.

03

Intermediate transformations reveal differences between GLM embedding choices.

Abstract

Microbiome functions are encoded within the genes of the community-wide metagenome. A natural question is whether properties of a microbial community can be predicted just from knowing the raw DNA sequences of its members. In this work, we employ set-aggregated genome embeddings (SAGE) to predict community-level abundance profiles, exploiting the few-shot learning capabilities of genomic language models (GLMs). We benchmark this approach to show improved generalization on novel genomes compared to classical bioinformatics approaches. Model ablation shows that community-level latent representations directly result in improved performance. Lastly, we demonstrate the benefits of intermediate transformations between latent representations and demonstrate the differences between GLM embedding choices.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.