Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability
Qihan Ren, Peng Wang, Ruikun Cai, Shuai Shao, Dadi Guo, Yuejin Xie, Yafu Li, Quanshi Zhang, Xia Hu, Jing Shao, Dongrui Liu

TL;DR
This paper reevaluates the generalization capabilities of reasoning supervised fine-tuning in large language models, emphasizing the roles of optimization, data quality, and model strength in cross-domain reasoning performance.
Contribution
It challenges the simplistic view that SFT memorizes and RL generalizes, showing that generalization depends on multiple factors including training dynamics, data, and model capability.
Findings
Cross-domain generalization is conditional and influenced by training dynamics.
Longer training can recover and improve initial performance dips.
Higher model capability leads to better transfer of reasoning skills.
Abstract
A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Jackrong/Qwen3.5-9B-DeepSeek-V4-Flash-GGUFmodel· 440k dl· ♡ 170440k dl♡ 170
- 🤗Jackrong/Qwen3.5-9B-GLM5.1-Distill-v1-GGUFmodel· 191k dl· ♡ 101191k dl♡ 101
- 🤗Jackrong/Qwopus3.5-9B-v3.5model· 2.3k dl· ♡ 232.3k dl♡ 23
- 🤗Jackrong/Qwopus-GLM-18B-Merged-GGUFmodel· 59k dl· ♡ 25159k dl♡ 251
- 🤗Jackrong/Qwen3.5-9B-DeepSeek-V4-Flashmodel· 5.5k dl· ♡ 235.5k dl♡ 23
- 🤗Jackrong/Gemopus-4-31B-itmodel· 126 dl· ♡ 9126 dl♡ 9
- 🤗Jackrong/Qwen3.5-27B-GLM5.1-Distill-v1-GGUFmodel· 3.6k dl· ♡ 163.6k dl♡ 16
- 🤗rico03/Qwopus-GLM-9B-DualReason-Distilled-GGUFmodel· 683 dl· ♡ 2683 dl♡ 2
- 🤗jasonrqh/Qwen3-1.7B_Math-CoT-20k_lr5e-5_ep8_bs256model
- 🤗jasonrqh/Qwen3-1.7B_Countdown-CoT-20k_lr5e-5_ep8_bs256model
- jasonrqh/Countdown-CoT-20kdataset· 102 dl102 dl
- jasonrqh/DeepSeek-R1-20kdataset· 580 dl580 dl
- jasonrqh/Math-CoT-20kdataset· 204 dl204 dl
- jasonrqh/Math-NoCoT-20kdataset· 87 dl87 dl
- jasonrqh/NuminaMath-20kdataset· 40 dl40 dl
- jasonrqh/Math-CoT-44k-Qwen3-32b-n32-16384-with-logprob-and-entropydataset· 6.9k dl6.9k dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
