The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data

Christina Baek; Ricardo Pio Monti; David Schwab; Amro Abbas; Rishabh Adiga; Cody Blakeney; Maximilian B\"other; Paul Burstein; Aldo Gael Carranza; Alvin Deng; Parth Doshi; Vineeth Dorna; Alex Fang; Tony Jiang; Siddharth Joshi; Brett W. Larsen; Jason Chan Lee; Katherine L. Mentzer; Luke Merrick; Haakon Mongstad; Fan Pan; Anshuman Suri; Darren Teh; Jason Telanoff; Jack Urbanek; Zhengping Wang; Josh Wills; Haoli Yin; Aditi Raghunathan; J. Zico Kolter; Bogdan Gaza; Ari Morcos; Matthew Leavitt; Pratyush Maini

arXiv:2603.16177·cs.LG·March 24, 2026

The Finetuner's Fallacy: When to Pretrain with Your Finetuning Data

Christina Baek, Ricardo Pio Monti, David Schwab, Amro Abbas, Rishabh Adiga, Cody Blakeney, Maximilian B\"other, Paul Burstein, Aldo Gael Carranza, Alvin Deng, Parth Doshi, Vineeth Dorna, Alex Fang, Tony Jiang, Siddharth Joshi, Brett W. Larsen, Jason Chan Lee

PDF

Open Access

TL;DR

This paper introduces specialized pretraining (SPT), a method that improves domain-specific performance and preserves general knowledge by repeating small domain datasets during pretraining, reducing overfitting and compute needs.

Contribution

The paper proposes SPT, a novel pretraining strategy that enhances domain adaptation and knowledge retention compared to standard pretraining methods.

Findings

01

SPT improves domain performance and preserves general capabilities.

02

SPT reduces pretraining tokens needed by up to 1.75x.

03

SPT outperforms larger models on underrepresented domains.

Abstract

Real-world model deployments demand strong performance on narrow domains where data is often scarce. Typically, practitioners finetune models to specialize them, but this risks overfitting to the domain and forgetting general knowledge. We study a simple strategy, specialized pretraining (SPT), where a small domain dataset, typically reserved for finetuning, is repeated starting from pretraining as a fraction of the total tokens. Across three specialized domains (ChemPile, MusicPile, and ProofPile), SPT improves domain performance and preserves general capabilities after finetuning compared to standard pretraining. In our experiments, SPT reduces the pretraining tokens needed to reach a given domain performance by up to 1.75x. These gains grow when the target domain is underrepresented in the pretraining corpus: on domains far from web text, a 1B SPT model outperforms a 3B standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Domain Adaptation and Few-Shot Learning · Mental Health via Writing