In Search of Lost DNA Sequence Pretraining
Zhijiang Tang, Jiaxin Qi, Yan Cui, Jinli Ou, Yuhua Zheng, Jianqiang Huang

TL;DR
This paper critically examines DNA sequence pretraining, identifying overlooked issues and proposing guidelines and a standardized testbed to improve reproducibility and evaluation in genomic foundation models.
Contribution
It reveals key problems in DNA pretraining, offers principled guidelines, and introduces a standardized benchmark for rigorous evaluation.
Findings
Identified inappropriate datasets and flawed masking strategies in existing methods.
Provided guidelines for dataset selection, task design, and vocabulary analysis.
Established a standardized testbed for reproducible DNA pretraining benchmarking.
Abstract
DNA sequence encoding is fundamental to gene function prediction, protein synthesis, and diverse downstream biological tasks. Despite the substantial progress achieved by large-scale DNA sequence pretraining, existing studies have overwhelmingly emphasized pretraining scale and custom downstream evaluation datasets, while neglecting some essential components of the pretraining paradigm. In this paper, we reveal three critical yet heretofore overlooked problems in DNA pretraining: inappropriate downstream datasets, inherent flaws in the neighbor-masking strategy, and the lack of detailed discussion on vocabulary. Therefore, we undertake comprehensive investigations and propose principled guidelines, including selection criteria for evaluation datasets, guiding task design, and in-depth vocabulary analysis. Extensive experiments validate the significance of our identified problems and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
