Empowering Multi-Turn Tool-Integrated Agentic Reasoning with Group Turn Policy Optimization
Yifeng Ding, Hung Le, Songyang Han, Kangrui Ruan, Zhenghui Jin, Varun Kumar, Zijian Wang, Anoop Deoras

TL;DR
This paper introduces GTPO, a new reinforcement learning algorithm that enhances multi-turn tool-integrated reasoning in large language models by providing fine-grained rewards and self-supervised signals, leading to improved performance.
Contribution
GTPO offers a novel turn-level reward assignment, advantage estimation, and self-supervised reward shaping, significantly improving training effectiveness over existing methods like GRPO.
Findings
GTPO outperforms GRPO by 3.0% on math reasoning benchmarks.
GTPO improves performance by 3.9% on commonsense reasoning and program synthesis tasks.
GTPO achieves these improvements with negligible computational overhead.
Abstract
Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches. Current RL methods, exemplified by Group Relative Policy Optimization (GRPO), suffer from coarse-grained, trajectory-level rewards that provide insufficient learning signals for complex multi-turn interactions, leading to training stagnation. To address this issue, we propose Group Turn Policy Optimization (GTPO), a novel RL algorithm specifically designed for training LLMs on multi-turn TIR tasks. GTPO introduces three key innovations: (1) turn-level reward assignment that provides fine-grained feedback for individual turns, (2) return-based advantage estimation where normalized discounted returns are calculated as advantages, and (3) self-supervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
