G$^2$RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance

Yongxin Guo; Wenbo Deng; Zhenglin Cheng; Xiaoying Tang

arXiv:2508.13023·cs.AI·August 19, 2025

G$^2$RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance

Yongxin Guo, Wenbo Deng, Zhenglin Cheng, Xiaoying Tang

PDF

Open Access

TL;DR

G$^2$RPO-A is an adaptive reinforcement learning method that dynamically adjusts guidance during training, significantly improving small language models' reasoning and code-generation abilities by injecting ground-truth reasoning steps.

Contribution

It introduces G$^2$RPO-A, an adaptive guidance algorithm that automatically tunes guidance strength based on training dynamics, enhancing small language models' performance.

Findings

01

G$^2$RPO-A outperforms vanilla GRPO on reasoning and code benchmarks.

02

Adaptive guidance improves training efficiency and model accuracy.

03

Ground-truth reasoning injection benefits small models significantly.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has markedly enhanced the reasoning abilities of large language models (LLMs). Its success, however, largely depends on strong base models with rich world knowledge, yielding only modest improvements for small-size language models (SLMs). To address this limitation, we investigate Guided GRPO, which injects ground-truth reasoning steps into roll-out trajectories to compensate for SLMs' inherent weaknesses. Through a comprehensive study of various guidance configurations, we find that naively adding guidance delivers limited gains. These insights motivate G $^{2}$ RPO-A, an adaptive algorithm that automatically adjusts guidance strength in response to the model's evolving training dynamics. Experiments on mathematical reasoning and code-generation benchmarks confirm that G $^{2}$ RPO-A substantially outperforms vanilla GRPO. Our code and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Simulation Techniques and Applications