Improving Genomic Models via Task-Specific Self-Pretraining

Sohan Mupparapu; Parameswari Krishnamurthy; Ratish Puduppully

arXiv:2506.17766·q-bio.GN·June 24, 2025

Improving Genomic Models via Task-Specific Self-Pretraining

Sohan Mupparapu, Parameswari Krishnamurthy, Ratish Puduppully

PDF

TL;DR

This paper proposes a task-specific self-pretraining approach for DNA language models, which achieves comparable or better performance than full genome pretraining while being more resource-efficient, offering a practical alternative for genomic modeling.

Contribution

It introduces a self-pretraining method on task-specific data for DNALMs, demonstrating competitive performance with reduced computational costs.

Findings

01

Self-pretraining matches or exceeds scratch training performance.

02

Task-specific pretraining is more compute-efficient.

03

Genome-scale pretraining still yields higher absolute performance.

Abstract

Pretraining DNA language models (DNALMs) on the full human genome is resource-intensive, yet often considered necessary for strong downstream performance. Inspired by recent findings in NLP and long-context modeling, we explore an alternative: self-pretraining on task-specific, unlabeled data. Using the BEND benchmark, we show that DNALMs trained with self-pretraining match or exceed the performance of models trained from scratch under identical compute. While genome-scale pretraining may still offer higher absolute performance, task-specific self-pretraining provides a practical and compute-efficient strategy for building stronger supervised baselines.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.