TL;DR
AGPO introduces an adaptive, critic-free reinforcement learning method that dynamically adjusts training parameters based on statistical feedback, improving large language model reasoning across multiple benchmarks.
Contribution
It presents a novel adaptive group policy optimization technique that enhances training stability and performance without critic networks, outperforming traditional methods on various benchmarks.
Findings
AGPO outperforms PPO/GRPO on nine benchmarks, including GSM8K and MATH.
Gains transfer to other models like Llama-3-8B and Gemma-2-9B.
Ablation studies confirm the effectiveness of both modules.
Abstract
Reinforcement learning improves LLM reasoning, but PPO/GRPO typically use fixed clipping and decoding temperature, which makes training brittle and tuning-heavy. We propose Adaptive Group Policy Optimization (AGPO), a critic-free refinement of GRPO that uses group-level statistics to control both update magnitude and exploration. AGPO uses a shared probe-derived statistical state to drive two controllers: (i) adaptive clipping, which sets the trust-region size from reward dispersion and skewness, probe vote entropy, policy entropy, and step-wise KL drift; and (ii) bidirectional adaptive temperature sampling, which heats or cools decoding around a base temperature according to centered uncertainty relative to a running baseline. On nine English and Chinese math/STEM benchmarks, Qwen2.5-14B trained with AGPO outperforms PPO/GRPO under the same generated-token budget, reaching 67.3% on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
