TL;DR
NucEL introduces an ELECTRA-style pre-training framework for genomic sequences that uses single-nucleotide tokens and a discriminator-generator setup, achieving state-of-the-art results with improved efficiency and interpretability.
Contribution
This paper presents the first ELECTRA-style pre-training method for genomic data, utilizing single-nucleotide tokens and hybrid attention to enhance efficiency and biological interpretability.
Findings
Achieves state-of-the-art performance on multiple genomic tasks.
Outperforms MLM-based models and rivals larger models.
Provides biologically relevant motif insights through attention analysis.
Abstract
Pre-training large language models on genomic sequences is a powerful approach for learning biologically meaningful representations. Masked language modeling (MLM) methods, such as DNABERT and Nucleotide Transformer (NT), achieve strong performance but suffer from partial token supervision, pre-training/fine-tuning mismatches, and high computational costs. We introduce NucEL, the first ELECTRA-style pre-training framework for genomic foundation models, addressing these limitations. Using a discriminator to identify tokens altered by a generator, NucEL provides comprehensive token-level supervision across all sequence positions, improving efficiency over the partial supervision of MLM. Incorporating ModernBERT's hybrid local-global attention and flash attention, NucEL offers an optimized BERT architecture for genomic modeling. Unlike 6-mer tokenization, NucEL uses single-nucleotide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
