NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations

Ke Ding; Brian Parker; Jiayu Wen

arXiv:2508.13191·q-bio.GN·August 20, 2025

NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations

Ke Ding, Brian Parker, Jiayu Wen

PDF

1 Video

TL;DR

NucEL introduces an ELECTRA-style pre-training framework for genomic sequences that uses single-nucleotide tokens and a discriminator-generator setup, achieving state-of-the-art results with improved efficiency and interpretability.

Contribution

This paper presents the first ELECTRA-style pre-training method for genomic data, utilizing single-nucleotide tokens and hybrid attention to enhance efficiency and biological interpretability.

Findings

01

Achieves state-of-the-art performance on multiple genomic tasks.

02

Outperforms MLM-based models and rivals larger models.

03

Provides biologically relevant motif insights through attention analysis.

Abstract

Pre-training large language models on genomic sequences is a powerful approach for learning biologically meaningful representations. Masked language modeling (MLM) methods, such as DNABERT and Nucleotide Transformer (NT), achieve strong performance but suffer from partial token supervision, pre-training/fine-tuning mismatches, and high computational costs. We introduce NucEL, the first ELECTRA-style pre-training framework for genomic foundation models, addressing these limitations. Using a discriminator to identify tokens altered by a generator, NucEL provides comprehensive token-level supervision across all sequence positions, improving efficiency over the partial supervision of MLM. Incorporating ModernBERT's hybrid local-global attention and flash attention, NucEL offers an optimized BERT architecture for genomic modeling. Unlike 6-mer tokenization, NucEL uses single-nucleotide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

NucEL: Single-Nucleotide ELECTRA-Style Genomic Pre-training for Efficient and Interpretable Representations· underline