Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

Qihan Ren; Peng Wang; Ruikun Cai; Shuai Shao; Dadi Guo; Yuejin Xie; Yafu Li; Quanshi Zhang; Xia Hu; Jing Shao; Dongrui Liu

arXiv:2604.06628·cs.AI·April 9, 2026

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

Qihan Ren, Peng Wang, Ruikun Cai, Shuai Shao, Dadi Guo, Yuejin Xie, Yafu Li, Quanshi Zhang, Xia Hu, Jing Shao, Dongrui Liu

PDF

1 Repo 50 Models 6 Datasets

TL;DR

This paper reevaluates the generalization capabilities of reasoning supervised fine-tuning in large language models, emphasizing the roles of optimization, data quality, and model strength in cross-domain reasoning performance.

Contribution

It challenges the simplistic view that SFT memorizes and RL generalizes, showing that generalization depends on multiple factors including training dynamics, data, and model capability.

Findings

01

Cross-domain generalization is conditional and influenced by training dynamics.

02

Longer training can recover and improve initial performance dips.

03

Higher model capability leads to better transfer of reasoning skills.

Abstract

A prevailing narrative in LLM post-training holds that supervised finetuning (SFT) memorizes while reinforcement learning (RL) generalizes. We revisit this claim for reasoning SFT with long chain-of-thought (CoT) supervision and find that cross-domain generalization is not absent but conditional, jointly shaped by optimization dynamics, training data, and base-model capability. Some reported failures are under-optimization artifacts: cross-domain performance first degrades before recovering and improving with extended training (a dip-and-recovery pattern), so shorttraining checkpoints can underestimate generalization. Data quality and structure both matter: low-quality solutions broadly hurt generalization,while verified long-CoT traces yield consistent cross-domain gains. Model capability is essential: stronger models internalize transferable procedural patterns (e.g., backtracking)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nebularaid2000/rethink_sft_generalization
github

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.