Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting
Howard Chen, Noam Razin, Karthik Narasimhan, Danqi Chen

TL;DR
This paper investigates how on-policy data in reinforcement learning helps language models retain prior knowledge better than supervised fine-tuning, providing practical guidelines for reducing catastrophic forgetting.
Contribution
It identifies the mode-seeking nature of RL's on-policy data as a key factor in mitigating forgetting, offering insights for more efficient continual learning.
Findings
RL leads to less forgetting than SFT across multiple LM families and tasks.
On-policy data's mode-seeking property helps preserve prior knowledge.
Using approximately on-policy data can effectively reduce forgetting in practice.
Abstract
Adapting language models (LMs) to new tasks via post-training carries the risk of degrading existing capabilities -- a phenomenon classically known as catastrophic forgetting. In this paper, toward identifying guidelines for mitigating this phenomenon, we systematically compare the forgetting patterns of two widely adopted post-training methods: supervised fine-tuning (SFT) and reinforcement learning (RL). Our experiments reveal a consistent trend across LM families (Llama, Qwen) and tasks (instruction following, general knowledge, and arithmetic reasoning): RL leads to less forgetting than SFT while achieving comparable or higher target task performance. To investigate the cause for this difference, we consider a simplified setting in which the LM is modeled as a mixture of two distributions, one corresponding to prior knowledge and the other to the target task. We identify that the…
Peer Reviews
Decision·Submitted to ICLR 2026
- The paper conducts both experimental evaluations and theoretical analyses of catastrophic forgetting in SFT and RL, and the conclusions are convincing. - It clearly shows that the reason RL resists catastrophic forgetting lies in its on-policy nature of data, rather than KL regularization or advantage estimation, which is an observation of notable value.
- The practicality of Iterative-SFT may be limited for two reasons: 1) Since the policy model generates its own training data, the generated examples may not be sufficiently challenging compared to data produced by a stronger teacher model; 2) It requires a reward model or rule-based verification methods to score and filter the data; however, because the policy model itself may not be well-versed in the target domain, the proportion of high-quality samples could be low, placing high demands on t
- The paper articulates a concrete and important question, i.e., why RL fine-tuning forgets less than SFT, and provides extensive experimental results supporting the finding across architectures and datasets. - The writing and figures are well-organized. In particular, the “gain–drop” metric and visualization (Figure 2) make results intuitive, and the toy Gaussian analysis offers a didactic explanation. - The study disentangles potential confounders (KL regularization, advantage estimation) an
- The finding that on-policy learning mitigates forgetting better than off-policy learning has already been explored in both RL and alignment literature. Notably, “Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data” (Tajwar et al., 2024) also frames on-policy vs off-policy updates as mode-seeking vs mode-covering, drawing the same connection between reverse KL and improved retention. Thus, while this paper extends that reasoning to an explicit forgetting study, its concept
1. The paper provides a comprehensive and rigorous empirical evaluation across diverse tasks, making the findings highly robust and generalizable. 2. The authors offer an intuitive yet formal explanation for the observed phenomenon by modeling the policy as a mixture of distributions and linking forgetting behavior to the mode-seeking nature of reverse KL minimization. 3. The practical implication that approximately on-policy data can significantly reduce forgetting is a valuable and efficient a
1. Forgetting is measured via average accuracy drops; other forms of degradation (semantic drift, safety loss, calibration changes) are not quantitatively explored. 2. The experiments are limited to models of up to 8B parameters, and it is unclear whether the same trends hold for significantly larger or smaller models, limiting the scalability claims. 3. While the Gaussian mixture analogy provides valuable intuition, it may oversimplify the complex, high-dimensional, and often non-Gaussian natur
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Topic Modeling
