TL;DR
This paper introduces the Plasticity-Ceiling Framework to optimize expert trajectory utilization in LLM post-training for mathematical reasoning, emphasizing sequential SFT-then-RL and data-driven scaling guidelines.
Contribution
It proposes a new framework for understanding and improving post-training performance, establishing the superiority of sequential SFT-then-RL and providing practical scaling and trajectory selection guidelines.
Findings
Sequential SFT-then-RL outperforms synchronized approaches.
Transitioning to RL at the stable SFT regime maximizes performance.
Data scale is the primary factor influencing post-training potential.
Abstract
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) dominate the post-training landscape for mathematical reasoning, yet differ fundamentally in their reliance on expert trajectories. To understand the optimal way to harness these trajectories for maximizing performance, we propose the Plasticity-Ceiling Framework. This framework empirically grounds the post-training landscape by decomposing the final performance ceiling into the foundational SFT performance and the subsequent RL plasticity (i.e., the maximum improvement via RL). Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability and premature convergence deficits inherent in synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the Stable or Mild Overfitting Regime of SFT maximizes the final ceiling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
