Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization
Xueyun Tian, Minghua Ma, Bingbing Xu, Nuoyan Lyu, Wei Li, Heng Dong, Zheng Chu, Yuanzhuo Wang, Huawei Shen

TL;DR
Incorporating negative reasoning trajectories into supervised fine-tuning significantly improves out-of-domain generalization in large language models by leveraging valid intermediate reasoning steps and adaptive loss weighting.
Contribution
This paper reveals the benefits of using negative reasoning samples in fine-tuning and introduces GLOW, a novel loss weighting scheme that enhances model generalization and exploration.
Findings
Negative trajectories improve OOD generalization.
GLOW achieves 5.51% OOD gain over positive-only training.
Model performance on MMLU increases from 72.82% to 76.47%.
Abstract
Supervised fine-tuning (SFT) on chain-of-thought (CoT) trajectories demonstrations is a common approach for enabling reasoning in large language models. Standard practices typically only retain trajectories with correct final answers (positives) while ignoring the rest (negatives). We argue that this paradigm discards substantial supervision and exacerbates overfitting, limiting out-of-domain (OOD) generalization. Specifically, we surprisingly find that incorporating negative trajectories into SFT yields substantial OOD generalization gains over positive-only training, as these trajectories often retain valid intermediate reasoning despite incorrect final answers. To understand this effect in depth, we systematically analyze data, training dynamics, and inference behavior, identifying 22 recurring patterns in negative chains that serve a dual role: they moderate loss descent to mitigate…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Well-written - Interesting observations regarding negative reasoning traces, and a relatively simple proposed fix via loss-based reweighting - Promising preliminary empirical results
- Theoretical discussions and statements in the main text and Appendix A.5 should be cleaned up: - In Eqn. (7), the index $t$ is missing rom LHS and $\ell_i^{(t)}$ should be just $\ell_i$? I guess that this is the training objective per epoch? So then $t$ is the epoch, not the individual (SGD) step? - What is $\delta$ in line 404? - Please utilize appropriate amsthm environments (lemma, theorem, etc). Also, theorem statements should not contain parts of the proof. - (A4) is not reall
(1) This paper presents a novel perspective by treating incorrect reasoning trajectories as useful supervision. (2) The findings that incorrect reasoning samples are useful supervision are insightful, providing a potential direction for improving OOD generation in LLMs (3) The writing is well-structured and easy to follow. (4) Both empirical evidence and theory analysis are provided.
[1] The claim that this is “the first systematic study demonstrating that negative reasoning samples constitute valuable supervision” may be overstated, as similar claims have been made in prior work (e.g., [1]). [2] The experimental evaluation is limited in scope, as most results are based on Qwen models. It remains unclear whether GLOW would provide consistent improvements across other architectures such as GPT-OSS-20B or LLaMA. [1] Xinyu Zhu et al. The Surprising Effectiveness of Negative
1. The empirical finding is interesting. 2. The empirical analysis is interesting.
As the paper mentions, the pre-training + post-training (SFT, RL) is the common paradigm. There are clear evidence that shows that RL generalizes and SFT memorizes,e.g. “SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training” Modern post-training of reasoning models never stop at SFT, but always include the RL stage as it dramatically improves performance. Given that it is possible that RL improves generalization, for this work to be practically impactful it would n
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
