GAGPO: Generalized Advantage Grouped Policy Optimization

Siyuan Zhu; Chao Yu; Rongxin Yang; Zongkai Liu; Jinjun Hu; Qiwen Chen; Yibo Zhang

arXiv:2605.13217·cs.CL·May 14, 2026

GAGPO: Generalized Advantage Grouped Policy Optimization

Siyuan Zhu, Chao Yu, Rongxin Yang, Zongkai Liu, Jinjun Hu, Qiwen Chen, Yibo Zhang

PDF

TL;DR

GAGPO is a critic-free reinforcement learning method that improves credit assignment in multi-turn environments by constructing non-parametric value proxies from trajectories, leading to better performance and faster learning.

Contribution

GAGPO introduces a novel, critic-free approach for precise temporal credit assignment in multi-turn RL, using grouped value proxies and advantage normalization.

Findings

01

GAGPO outperforms strong RL baselines on ALFWorld and WebShop.

02

It achieves faster early-stage learning and improved interaction efficiency.

03

GAGPO demonstrates smoother optimization dynamics.

Abstract

Reinforcement learning has become a powerful paradigm for post-training large language model agents, yet credit assignment in multi-turn environments remains a challenge. Agents often receive sparse, trajectory-level rewards only at the end of an episode, making it difficult to determine which intermediate actions contributed to success or failure. As a result, propagating delayed outcomes back to individual decision steps without relying on costly auxiliary value models remains an open problem. We propose Generalized Advantage Grouped Policy Optimization (GAGPO), a critic-free reinforcement learning method for precise, step-aligned temporal credit assignment. GAGPO constructs a non-parametric grouped value proxy from sampled rollouts and uses it to compute TD/GAE-style temporal advantages, recursively propagating outcome supervision backward through time. Combined with group-wise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.