Improving Genomic Models via Task-Specific Self-Pretraining
Sohan Mupparapu, Parameswari Krishnamurthy, Ratish Puduppully

TL;DR
This paper proposes a task-specific self-pretraining approach for DNA language models, which achieves comparable or better performance than full genome pretraining while being more resource-efficient, offering a practical alternative for genomic modeling.
Contribution
It introduces a self-pretraining method on task-specific data for DNALMs, demonstrating competitive performance with reduced computational costs.
Findings
Self-pretraining matches or exceeds scratch training performance.
Task-specific pretraining is more compute-efficient.
Genome-scale pretraining still yields higher absolute performance.
Abstract
Pretraining DNA language models (DNALMs) on the full human genome is resource-intensive, yet often considered necessary for strong downstream performance. Inspired by recent findings in NLP and long-context modeling, we explore an alternative: self-pretraining on task-specific, unlabeled data. Using the BEND benchmark, we show that DNALMs trained with self-pretraining match or exceed the performance of models trained from scratch under identical compute. While genome-scale pretraining may still offer higher absolute performance, task-specific self-pretraining provides a practical and compute-efficient strategy for building stronger supervised baselines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
