On-Policy Supervised Fine-Tuning for Efficient Reasoning
Anhao Zhao, Ziyang Chen, Junlong Tong, Yingqi Fan, Fanghua Ye, Shuhao Li, Yunpu Ma, Wenjie Li, Xiaoyu Shen

TL;DR
This paper introduces a simplified on-policy supervised fine-tuning method for large reasoning models that improves efficiency and maintains accuracy by focusing on correctness and conciseness, outperforming complex reinforcement learning approaches.
Contribution
It demonstrates that removing complex reward components and using on-policy supervised fine-tuning achieves better efficiency and comparable or superior reasoning performance.
Findings
Reduces chain-of-thought length by up to 80% while maintaining accuracy.
Cuts GPU memory usage by 50% and speeds up training convergence by 70%.
Outperforms complex RL-based methods across five benchmarks.
Abstract
Large reasoning models (LRMs) are commonly trained with reinforcement learning (RL) to explore long chain-of-thought reasoning, achieving strong performance at high computational cost. Recent methods add multi-reward objectives to jointly optimize correctness and brevity, but these complex extensions often destabilize training and yield suboptimal trade-offs. We revisit this objective and challenge the necessity of such complexity. Through principled analysis, we identify fundamental misalignments in this paradigm: KL regularization loses its intended role when correctness and length are directly verifiable, and group-wise normalization becomes ambiguous under multiple reward signals. By removing these two items and simplifying the reward to a truncation-based length penalty, we show that the optimization problem reduces to supervised fine-tuning on self-generated data filtered for both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI)
