Graph-Enhanced Policy Optimization in LLM Agent Training
Jiazhen Yuan, Wei Zhao, Zhengbiao Bai

TL;DR
This paper introduces GEPO, a graph-based reinforcement learning method that dynamically models environment structure to improve exploration, credit assignment, and planning in training large language model agents.
Contribution
GEPO is the first approach to integrate dynamic graph construction and graph-theoretic signals into LLM agent training, addressing structural blindness issues.
Findings
Achieved success rate improvements of +4.1%, +5.3%, and +10.9% on three benchmarks.
Demonstrated robustness and generalizability of environmental structure modeling.
Enhanced exploration and credit assignment in multi-turn interactive tasks.
Abstract
Group based reinforcement learning (RL) has shown impressive results on complex reasoning and mathematical tasks. Yet, when applied to train multi-turn, interactive LLM agents, these methods often suffer from structural blindness-the inability to exploit the underlying connectivity of the environment. This manifests in three critical challenges: (1) inefficient, unguided exploration, (2) imprecise credit assignment due to overlooking pivotal states, and (3) myopic planning caused by static reward discounting. We address these issues with Graph-Enhanced Policy Optimization (GEPO), which dynamically constructs a state-transition graph from agent experience and employs graph-theoretic centrality to provide three synergistic learning signals: (1)structured intrinsic rewards that guide exploration toward high-impact states, (2) a graph-enhanced advantage function for topology-aware credit…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
