Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning

Bowen Ding; Yuhan Chen; Jiayang Lyv; Jiyao Yuan; Qi Zhu; Shuangshuang Tian; Dantong Zhu; Futing Wang; Heyuan Deng; Fei Mi; Lifeng Shang; Tao Lin

arXiv:2512.11470·cs.LG·May 12, 2026

Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning

Bowen Ding, Yuhan Chen, Jiayang Lyv, Jiyao Yuan, Qi Zhu, Shuangshuang Tian, Dantong Zhu, Futing Wang, Heyuan Deng, Fei Mi, Lifeng Shang, Tao Lin

PDF

1 Repo

TL;DR

This paper introduces the Plasticity-Ceiling Framework to optimize expert trajectory utilization in LLM post-training for mathematical reasoning, emphasizing sequential SFT-then-RL and data-driven scaling guidelines.

Contribution

It proposes a new framework for understanding and improving post-training performance, establishing the superiority of sequential SFT-then-RL and providing practical scaling and trajectory selection guidelines.

Findings

01

Sequential SFT-then-RL outperforms synchronized approaches.

02

Transitioning to RL at the stable SFT regime maximizes performance.

03

Data scale is the primary factor influencing post-training potential.

Abstract

Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) dominate the post-training landscape for mathematical reasoning, yet differ fundamentally in their reliance on expert trajectories. To understand the optimal way to harness these trajectories for maximizing performance, we propose the Plasticity-Ceiling Framework. This framework empirically grounds the post-training landscape by decomposing the final performance ceiling into the foundational SFT performance and the subsequent RL plasticity (i.e., the maximum improvement via RL). Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability and premature convergence deficits inherent in synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the Stable or Mild Overfitting Regime of SFT maximizes the final ceiling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lins-lab/RETU
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.