Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection

Louis Bethune; David Grangier; Dan Busbridge; Eleonora Gualdoni; Marco Cuturi; Pierre Ablin

arXiv:2502.06042·cs.LG·May 28, 2025

Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection

Louis Bethune, David Grangier, Dan Busbridge, Eleonora Gualdoni, Marco Cuturi, Pierre Ablin

PDF

Open Access

TL;DR

This paper derives scaling laws to quantify forgetting and overfitting during finetuning with pretraining data injection, showing that minimal pretraining data injection can prevent forgetting.

Contribution

It introduces scaling laws that describe how target data size and pretraining data injection affect forgetting and overfitting during finetuning.

Findings

01

Injecting 1% pretraining data prevents forgetting.

02

Scaling laws quantify the trade-off between overfitting and forgetting.

03

Pretraining data injection improves finetuning efficiency.

Abstract

A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised next-token prediction on data from that target domain. Finetuning presents two challenges: (i) if the amount of target data is limited, as in most practical applications, the model will quickly overfit, and (ii) the model will drift away from the original model, forgetting the pretraining data and the generic knowledge that comes with it. We aim to derive scaling laws that quantify these two phenomena for various target domains, amounts of available target data, and model scales. We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting. A key practical takeaway from our study is that injecting as little as 1% of pretraining data in the finetuning data mixture prevents…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · Computer Graphics and Visualization Techniques · Neural Networks and Applications