GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning
Ziru Liu, Cheng Gong, Xinyu Fu, Yaofang Liu, Ran Chen, Shoubo Hu, Suiyun Zhang, Rui Liu, Qingfu Zhang, Dandan Tu

TL;DR
GHPO introduces an adaptive, difficulty-aware reinforcement learning framework that improves training stability and performance of large language models in complex reasoning tasks by dynamically balancing imitation and exploration strategies.
Contribution
The paper presents GHPO, a novel adaptive guidance method that calibrates task difficulty for stable and efficient LLM reinforcement learning, outperforming existing methods.
Findings
Achieves ~5% performance improvement on six math benchmarks
Enhances training stability and reasoning accuracy
Balances imitation and exploration for curriculum learning
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for facilitating the self-improvement of large language models (LLMs), particularly in the domain of complex reasoning tasks. However, prevailing on-policy RL methods often contend with significant training instability and inefficiency. This is primarily due to a capacity-difficulty mismatch, where the complexity of training data frequently outpaces the model's current capabilities, leading to critically sparse reward signals and stalled learning progress. This challenge is particularly acute for smaller, more resource-efficient LLMs. To overcome this, we introduce the Guided Hybrid Policy Optimization (GHPO), a novel difficulty-aware reinforcement learning framework. GHPO dynamically calibrates task difficulty by employing adaptive prompt refinement to provide targeted guidance. This…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Directly tackles reward sparsity with a simple mechanism (difficulty detection + adaptive hints) that keeps training on-policy and lightweight. 2. Empirical gains and stability. The paper reports consistent improvements (+5% across six math benchmarks) and shows training-dynamics curves where GHPO achieves higher accuracy reward, longer reasoning traces, and smaller gradient norms than GRPO.
1. Brittle difficulty detector. The “all-zero” rule (difficult iff all G rewards are 0) is too hard; with G=8, a single lucky success disables guidance. ~60% of problems are still flagged as difficult during training, suggesting heavy hint reliance under a rigid threshold. 2. Unvalidated cold start. Detection is disabled for the first 20 steps (pure GRPO), which is reasonable but ad-hoc; there is no ablation of N.
**Novel approach:** The method combines imitation learning and reinforcement learning in a dynamic way and introduces a multi-stage hint mechanism to improve training stability and efficiency **Comprehensive experimentation:** The authors evaluate the approach on two types of training datasets (medium and hard), two base models (Qwen2.5-Base-7B and Qwen2.5-Math-7B), and several benchmarks (Math_500, OlympiadBench, AIME). GHPO shows consistent improvements over GRPO and GRPO+CL.
1. The citation and reference formatting do not follow ICLR standards. Several entries lack venue, volume, or page information, and some published works (e.g., DeepSeek-R1, OlympiadBench) are not properly cited, which affects the paper’s professionalism. 2. The paper does not provide any sensitivity or ablation study for key hyperparameters such as the hint ratio (ω) or the number of training stages, making it unclear how robust the method is to these design choices. 3. There is insufficient dis
- The paper clearly identifies a critical and practical failure mode in on-policy RLVR (Section 2.3). It correctly points out that when all $G$ responses in a group receive a zero reward, both the mean and standard deviation of the rewards become zero, which nullifies the advantage signal and provides no gradient for the update. GHPO's mechanism of using this "all-zero" state as an online difficulty detector is pragmatic and intuitive. - The paper is well-written, and the method is explained cle
- Weak Baseline Comparisons: The paper's empirical evaluation is limited by comparing GHPO almost exclusively to variants of GRPO (standard GRPO and GRPO-CL). Without comparisons to these state-of-the-art methods (LUFFY, UFT, ...), it is impossible to situate GHPO's performance and determine if its gains are a significant advancement over the field or merely an improvement on a specific, and possibly simple, baseline. - The core idea of GHPO is to "adaptively balance direct imitation learning...
1. The idea of using adaptive hints to improve RL stability and efficiency is interesting and intuitive. 2. The proposed GHPO achives better performance than the GRPO/GRPO-CL baselines.
1. The paper criticizes DAPO for discarding a significant portion of training data. But DAPO can potentially reuse those discarded training data at later stages when the model improves. So the criticism may not be entirely fair. 2. The experiments do not include an ablation study for DAPO, which limits the completeness of the comparison. 3. The proposed method may not generalize to datasets where the ground-truth solution lacks a chain-of-thought (CoT) component. For example, in multiple-choice
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIterative Learning Control Systems · Reinforcement Learning in Robotics · Elevator Systems and Control
