G$^2$RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance
Yongxin Guo, Wenbo Deng, Zhenglin Cheng, Xiaoying Tang

TL;DR
G$^2$RPO-A is an adaptive reinforcement learning method that dynamically adjusts guidance during training, significantly improving small language models' reasoning and code-generation abilities by injecting ground-truth reasoning steps.
Contribution
It introduces G$^2$RPO-A, an adaptive guidance algorithm that automatically tunes guidance strength based on training dynamics, enhancing small language models' performance.
Findings
G$^2$RPO-A outperforms vanilla GRPO on reasoning and code benchmarks.
Adaptive guidance improves training efficiency and model accuracy.
Ground-truth reasoning injection benefits small models significantly.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has markedly enhanced the reasoning abilities of large language models (LLMs). Its success, however, largely depends on strong base models with rich world knowledge, yielding only modest improvements for small-size language models (SLMs). To address this limitation, we investigate Guided GRPO, which injects ground-truth reasoning steps into roll-out trajectories to compensate for SLMs' inherent weaknesses. Through a comprehensive study of various guidance configurations, we find that naively adding guidance delivers limited gains. These insights motivate GRPO-A, an adaptive algorithm that automatically adjusts guidance strength in response to the model's evolving training dynamics. Experiments on mathematical reasoning and code-generation benchmarks confirm that GRPO-A substantially outperforms vanilla GRPO. Our code and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Simulation Techniques and Applications
