A Practical Two-Stage Recipe for Mathematical LLMs: Maximizing Accuracy with SFT and Efficiency with Reinforcement Learning
Hiroshi Yoshihara, Taiki Yamaguchi, Yuichi Inoue

TL;DR
This paper presents a practical two-stage training approach combining extended Supervised Fine-Tuning and Reinforcement Learning to significantly improve the accuracy and efficiency of mathematical reasoning in Large Language Models.
Contribution
It introduces a systematic methodology that effectively integrates SFT and RL, demonstrating substantial performance gains and efficiency improvements in mathematical reasoning tasks.
Findings
Extending SFT to 10 epochs boosts performance.
GRPO primarily reduces solution length while maintaining accuracy.
Achieved top-tier results on the AIMO benchmark.
Abstract
Enhancing the mathematical reasoning of Large Language Models (LLMs) is a pivotal challenge in advancing AI capabilities. While Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are the dominant training paradigms, a systematic methodology for combining them to maximize both accuracy and efficiency remains largely unexplored. This paper introduces a practical and effective training recipe that strategically integrates extended SFT with RL from online inference (GRPO). We posit that these methods play complementary, not competing, roles: a prolonged SFT phase first pushes the model's accuracy to its limits, after which a GRPO phase dramatically improves token efficiency while preserving this peak performance. Our experiments reveal that extending SFT for as many as 10 epochs is crucial for performance breakthroughs, and that the primary role of GRPO in this framework is to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Mathematics, Computing, and Information Processing · Topic Modeling
MethodsShrink and Fine-Tune
