FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics
ChenRui Duan, Zelin Zang, Yongjie Xu, Hang He, Zihan Liu, Siyuan Li,, Zijia Song, Ju-Sheng Zheng, Stan Z. Li

TL;DR
FGBERT is a novel pre-trained gene language model that uses protein-based representations and advanced learning techniques to improve understanding of gene functions and relationships in complex metagenomic data.
Contribution
The paper introduces FGBERT, a pre-trained model employing protein-based tokenization, MGM, and TMC to better capture gene context and function in metagenomics, surpassing existing methods.
Findings
FGBERT outperforms existing models across multiple metagenomic levels.
It effectively captures gene-function relationships in large datasets.
Case studies demonstrate biological relevance and functional recognition.
Abstract
Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer, which limits the capture of structurally and functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes and fail to address the One-to-Many and Many-to-One relationships inherent in metagenomic data. To overcome these challenges, we introduce FGBERT, a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware and structure-relevant tokenizer. FGBERT incorporates Masked Gene Modeling (MGM) to enhance the understanding of inter-gene contextual relationships and Triplet Enhanced Metagenomic Contrastive Learning (TMC) to elucidate gene sequence-function relationships.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · Genetics, Bioinformatics, and Biomedical Research
MethodsContrastive Learning
