Mapping Post-Training Forgetting in Language Models at Scale
Jackson Harmon, Andreas Hochlehnert, Matthias Bethge, Ameya Prabhu

TL;DR
This paper introduces a sample-wise metric to measure and analyze how post-training affects pretrained knowledge in language models, revealing nuanced patterns of forgetting and backward transfer across different training stages and scales.
Contribution
It proposes a novel framework for quantifying forgetting and backward transfer at a granular level, providing insights into post-training effects on large language models.
Findings
Domain-continual pretraining causes moderate forgetting with low backward transfer.
RL/SFT and instruction tuning improve backward transfer on math and logic tasks.
Model merging does not effectively prevent forgetting.
Abstract
Scaled post-training now drives many of the largest capability gains in language models (LMs), yet its effect on pretrained knowledge remains poorly understood. Not all forgetting is equal: Forgetting one fact (e.g., a U.S. president or an API call) does not "average out" by recalling another. Hence, we propose a sample-wise paradigm to measure what is forgotten and when backward transfer occurs. Our metric counts 1->0 transitions (correct before post-training, incorrect after) to quantify forgetting and 0->1 transitions to quantify backward transfer. Traditional task averages conflate these effects and obscure large changes. For multiple-choice benchmarks, we add chance-adjusted variants that subtract the expected contribution of random guessing from pre- and post-training accuracies. We apply this framework across post-training stages, model sizes, and data scales. Our large-scale…
Peer Reviews
Decision·ICLR 2026 Poster
- This paper introduces a sample-wise, chance-adjusted metric to precisely quantify forgetting and backward transfer, overcoming the limitations of aggregate accuracy measures. - This paper conducts a broad empirical studies across multiple model sizes, training regimes (SFT, RLHF, instruction tuning, continual pretraining), and datasets, providing a systematic view of post-training dynamics. - This paper offers interest findings: instruction tuning and reasoning post-training yield strong backw
- This study focuses mainly on MCQ datasets, which may not generalize to open-ended tasks or generative reasoning. - The uniform random-guessing correction assumes independence and equal likelihood of options, which oversimplifies model behavior and may distort real forgetting dynamics. - While correlations between data scale, post-training type, and forgetting are documented, the paper does not deeply analyze why certain stages (e.g., reasoning training) cause particular effects.
1-The sample-wise paradigm directly addresses the non-fungibility of pretrained knowledge (a long-overlooked limitation of task-averaged metrics), and chance-adjusted metrics rigorously correct for random guessing—an essential step for reliable evaluation of multiple-choice benchmarks (the dominant format for knowledge-intensive LM tests). 2-The methodology is both theoretically sound (e.g., explicit assumptions about uniform guessing and independent pre/post events) and practically feasible (n
1-The paper identifies what post-training regimes cause forgetting/transfer but rarely explains why. For example: It notes "culture" is the most forgettable category across regimes (e.g., Llama-3.1-8B-Instruct has 18.9% forgetting in culture), but does not investigate whether this stems from cultural knowledge being less "entrenched" in pretraining, or post-training data mismatching cultural contexts. It finds reasoning training on base models outperforms instruction tuning in transfer, but does
1. Shifting the evaluation perspective from the macro-level average accuracy of the task to the micro-level of the sample is a highly insightful transformation. This allows us to understand more precisely the dynamic changes in the knowledge within the model, rather than simply seeing a vague final score. 2. The authors evaluated nearly 30 model-training combinations, covering the vast majority of mainstream post-training paths in the current LLM ecosystem. This work itself is a massive undert
1. The model's assumption of "uniform random guessing when unable to solve a problem, with independent guesses before and after" is somewhat idealistic. Discussions could be made regarding robustness checks using multiple sampling/deterministic decoding, or question-level chance correction based on logits, to relax the independence and uniformity assumption. 2. There are some shortcomings in the experiments. The current version reports a flipped variance/confidence interval, requiring significa
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education
