Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning
Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, Jiaya Jia

TL;DR
Scaf-GRPO is a progressive training framework that enhances LLM reasoning by strategically providing minimal guidance when models plateau, significantly improving performance on challenging benchmarks.
Contribution
Introduces Scaf-GRPO, a novel scaffolded training method that diagnoses learning stagnation and injects tiered hints to improve LLM reasoning on difficult tasks.
Findings
Boosted Qwen2.5-Math-7B's pass@1 score on AIME24 by 44.3%
Effectively overcomes the 'learning cliff' in policy optimization
Enhances autonomous reasoning in large language models
Abstract
Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ''learning cliff'' phenomenon: when faced with problems far beyond their current capabilities, models consistently fail, yielding a persistent zero-reward signal. In policy optimization algorithms like GRPO, this collapses the advantage calculation to zero, rendering these difficult problems invisible to the learning gradient and stalling progress. To overcome this, we introduce Scaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressive training framework that strategically provides minimal guidance only when a model's independent learning has plateaued. The framework first diagnoses learning stagnation and then intervenes by injecting tiered in-prompt…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper is well written and well organized. 2. Sufficient experiments are provided to validate the effectiveness of their method.
see questions
The strengths of this paper is listed as follows 1. The method proposed in this paper adapts external knowledge (solution) to on-policy solutions, which avoid the distribution shift resulted from directly utilizing off-policy solution. 2. The experiments are conducted on several different models and Scaf-GRPO consistently improves over baselines. Also, the ablation study is comprehensive and covers most of the design of proposed method. 3. Overall, the paper is clearly written.
In general, I think this paper does not demonstrate any major weakness. Some of the identified weakness and my questions are listed below 1. In equation (4), the author propose to use $\pi_{\theta}(\cdot|q\oplus h*)/\pi_{\text{old}}(\cdot|q\oplus h*)$ as the importance ratio. However, this does not exactly matches the probility $\pi_{\theta}(\cdot|q)$. To the reviewer, $\pi_{\theta}(\cdot|q)/\pi_{\text{old}}(\cdot|q\oplus h*)$ makes more sense. Could the authors compare these two different appr
In a learning cliff scenario, all rewards are zero, causing the advantage calculation to collapse and the learning gradient to vanish. Scaf-GRPO intervenes by generating a single successful trajectory on-policy using a minimally effective hint. This successful trajectory replaces a failed one in the batch, which "restores a meaningful advantage signal" and ensures "non-zero reward variance", allowing the standard GRPO update to resume.
- The framework's central claim to preserving the "on-policy principle" is questionable. When all trajectories for a query $q$ fail (the "learning cliff"), the method does not learn to solve $q$. Instead, it introduces a new input, $q \oplus h^{*}$ (query + hint), and learns from this new, simpler task. The policy is indeed 'on-policy' with respect to the augmented prompt, but it has failed and bypassed the original, unhinted task. This is a semantic argument that obscures the fact that the mode
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
