MetagenBERT: a Transformer-based Architecture using Foundational genomic Large Language Models for novel Metagenome Representation
Gaspar Roy, Eugeni Belda, Baptiste Hennecart, Yann Chevaleyre, Edi Prifti, Jean-Daniel Zucker

TL;DR
MetagenBERT introduces a Transformer-based framework that generates end-to-end metagenome embeddings directly from raw DNA sequences, improving disease prediction and enabling scalable, annotation-free analysis across diverse datasets.
Contribution
This work presents MetagenBERT, a novel Transformer architecture that produces metagenome representations from raw sequences, outperforming traditional species abundance methods and demonstrating robustness and transferability.
Findings
Achieves competitive or superior AUC in disease prediction tasks.
Clustering remains effective with as little as 10% of reads.
Transfer learning retains predictive signals across cohorts.
Abstract
Metagenomic disease prediction commonly relies on species abundance tables derived from large, incomplete reference catalogs, constraining resolution and discarding valuable information contained in DNA reads. To overcome these limitations, we introduce MetagenBERT, a Transformer based framework that produces end to end metagenome embeddings directly from raw DNA sequences, without taxonomic or functional annotations. Reads are embedded using foundational genomic language models (DNABERT2 and the microbiome specialized DNABERTMS), then aggregated through a scalable clustering strategy based on FAISS accelerated KMeans. Each metagenome is represented as a cluster abundance vector summarizing the distribution of its embedded reads. We evaluate this approach on five benchmark gut microbiome datasets (Cirrhosis, T2D, Obesity, IBD, CRC). MetagenBERT achieves competitive or superior AUC…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGut microbiota and health · Genomics and Phylogenetic Studies · Epigenetics and DNA Methylation
