Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning

Xichen Zhang; Sitong Wu; Yinghao Zhu; Haoru Tan; Shaozuo Yu; Ziyi He; Jiaya Jia

arXiv:2510.19807·cs.CL·March 3, 2026

Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning

Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, Jiaya Jia

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

Scaf-GRPO is a progressive training framework that enhances LLM reasoning by strategically providing minimal guidance when models plateau, significantly improving performance on challenging benchmarks.

Contribution

Introduces Scaf-GRPO, a novel scaffolded training method that diagnoses learning stagnation and injects tiered hints to improve LLM reasoning on difficult tasks.

Findings

01

Boosted Qwen2.5-Math-7B's pass@1 score on AIME24 by 44.3%

02

Effectively overcomes the 'learning cliff' in policy optimization

03

Enhances autonomous reasoning in large language models

Abstract

Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ''learning cliff'' phenomenon: when faced with problems far beyond their current capabilities, models consistently fail, yielding a persistent zero-reward signal. In policy optimization algorithms like GRPO, this collapses the advantage calculation to zero, rendering these difficult problems invisible to the learning gradient and stalling progress. To overcome this, we introduce Scaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressive training framework that strategically provides minimal guidance only when a model's independent learning has plateaued. The framework first diagnoses learning stagnation and then intervenes by injecting tiered in-prompt…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 2

Strengths

1. The paper is well written and well organized. 2. Sufficient experiments are provided to validate the effectiveness of their method.

Weaknesses

see questions

Reviewer 02Rating 6Confidence 3

Strengths

The strengths of this paper is listed as follows 1. The method proposed in this paper adapts external knowledge (solution) to on-policy solutions, which avoid the distribution shift resulted from directly utilizing off-policy solution. 2. The experiments are conducted on several different models and Scaf-GRPO consistently improves over baselines. Also, the ablation study is comprehensive and covers most of the design of proposed method. 3. Overall, the paper is clearly written.

Weaknesses

In general, I think this paper does not demonstrate any major weakness. Some of the identified weakness and my questions are listed below 1. In equation (4), the author propose to use $\pi_{\theta}(\cdot|q\oplus h*)/\pi_{\text{old}}(\cdot|q\oplus h*)$ as the importance ratio. However, this does not exactly matches the probility $\pi_{\theta}(\cdot|q)$. To the reviewer, $\pi_{\theta}(\cdot|q)/\pi_{\text{old}}(\cdot|q\oplus h*)$ makes more sense. Could the authors compare these two different appr

Reviewer 03Rating 2Confidence 3

Strengths

In a learning cliff scenario, all rewards are zero, causing the advantage calculation to collapse and the learning gradient to vanish. Scaf-GRPO intervenes by generating a single successful trajectory on-policy using a minimally effective hint. This successful trajectory replaces a failed one in the batch, which "restores a meaningful advantage signal" and ensures "non-zero reward variance", allowing the standard GRPO update to resume.

Weaknesses

- The framework's central claim to preserving the "on-policy principle" is questionable. When all trajectories for a query $q$ fail (the "learning cliff"), the method does not learn to solve $q$. Instead, it introduces a new input, $q \oplus h^{*}$ (query + hint), and learns from this new, simpler task. The policy is indeed 'on-policy' with respect to the augmented prompt, but it has failed and bypassed the original, unhinted task. This is a semantic argument that obscures the fact that the mode

Code & Models

Datasets

hkuzxc/scaf-grpo-dataset
dataset· 94 dl
94 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications