Unveiling Over-Memorization in Finetuning LLMs for Reasoning Tasks
Zhiwen Ruan, Yun Chen, Yutao Hou, Peng Li, Yang Liu, Guanhua Chen

TL;DR
This paper investigates the over-memorization phenomenon in finetuned large language models for reasoning tasks, revealing its prevalence, effects on robustness, and proposing mitigation strategies.
Contribution
It uncovers the over-memorization phenomenon during LLM finetuning, analyzes its causes, and offers techniques to mitigate its negative effects.
Findings
Over-memorization occurs during specific finetuning stages.
Over-memorized models have reduced robustness and generalization.
Proposed techniques help mitigate over-memorization effects.
Abstract
The pretrained large language models (LLMs) are finetuned with labeled data for better instruction following ability and alignment with human values. In this paper, we study the learning dynamics of LLM finetuning on reasoning tasks and reveal the uncovered over-memorization phenomenon during a specific stage of LLM finetuning. At this stage, the LLMs have excessively memorized training data and exhibit high test perplexity while maintaining good test accuracy. We explore the conditions that contribute to over-memorization and discover that this issue is prevalent across various tasks, models, and fine-tuning methods, with prolonged training and large learning rates exacerbating the problem. Although models with over-memorization demonstrate comparable test accuracy to normal models, they suffer from reduced robustness, poor out-of-distribution generalization, and decreased generation…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper systematically uncovers the “high accuracy, rising perplexity” over-memorization phenomenon during fine-tuning, verifies its prevalence across multiple learning rates and methods, and clearly distinguishes it from traditional overfitting. 2. It conducts a comprehensive experimental investigation that quantifies over-memorization’s negative impact on robustness, out-of-distribution generalization, Best-of-N sampling, and privacy risk with rich metrics. 3. The paper proposes practic
1. Lack of deep theoretical understanding: The paper’s central weakness is that it remains overly empirical. Its theoretical analysis of why over-memorization occurs is insufficient, offering only a superficial explanation via cross-entropy mechanics (Sec. 5). It provides no mathematical framework to predict when over-memorization will arise or to quantify its severity beyond empirical observation (Eqs. 3–4). 2. Questionable evaluation setup: Given the broad scope of the topic, comprehensive ex
Clear empirical pattern: Multiple plots/tables show rising test PPL without collapsing accuracy, across tasks and models. Breadth: Results cover math QA, code, and scientific QA; also Gemma/Mistral. Practical takeaways: Concrete checkpoint-selection advice (balance val-ACC with val-PPL) and evidence that it matters. Robustness/OOD/Diversity analyses: Over-memorized checkpoints are more brittle to neutral prompt preambles and underperform on OOD. Lightweight mitigations: Checkpoint mergin
Causality vs. correlation. While learning rate and training time correlate with the effect, other confounds (batch size, data curriculum/order, decoding temperature during evaluation, regularization, LoRA rank, prompt templates) are not systematically ruled out. The methodological breadth is good, but ablations feel incomplete Novelty positioning could be sharper. The paper claims to be the first to uncover this specific phenomenon; related work (e.g., learning-dynamics perspectives) is noted,
1、Practical training takeaway. The paper highlights a realistic failure mode in finetuning pipelines: validation perplexity may start to rise while task accuracy is still improving, so stopping purely on perplexity can prematurely discard useful checkpoints. This makes the work directly useful to practitioners who fine-tune reasoning-capable LMs. 2、Multi-faceted behavioral probing. Beyond reporting the metric divergence, the authors systematically examine its downstream effects on OOD generaliza
1、Marginal Novelty of the Core Phenomenon: The paper's central concept of "over-memorization"—defined as rising test perplexity while test accuracy remains stable —is insufficiently distinguished from classical overfitting. The authors define classical overfitting as rising perplexity and decreasing accuracy, but the behavior they identify is arguably just a minor variant of this, where the task-specific metric (accuracy) is less sensitive or lags behind the loss metric (perplexity). The claim t
1. The paper is well-written with clear storyline. 2. The paper's conclusion is clear, with experiments supported.
1. **Limited novelty and generalizability of the main conclusion** The central claim—that supervised fine-tuning (SFT) tends to memorize [2]—has been extensively discussed in prior work. The paper’s contribution would be stronger if it offered a more nuanced or novel perspective on this phenomenon. More critically, the experimental setup undermines the generalizability of the findings. The authors repeatedly fine-tune on a small dataset (e.g., 100K examples) for many epochs (e.g., 10)
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Multi-Agent Systems and Negotiation · Natural Language Processing Techniques
