Empowering Multi-Turn Tool-Integrated Agentic Reasoning with Group Turn Policy Optimization

Yifeng Ding; Hung Le; Songyang Han; Kangrui Ruan; Zhenghui Jin; Varun Kumar; Zijian Wang; Anoop Deoras

arXiv:2511.14846·cs.LG·April 21, 2026

Empowering Multi-Turn Tool-Integrated Agentic Reasoning with Group Turn Policy Optimization

Yifeng Ding, Hung Le, Songyang Han, Kangrui Ruan, Zhenghui Jin, Varun Kumar, Zijian Wang, Anoop Deoras

PDF

TL;DR

This paper introduces GTPO, a new reinforcement learning algorithm that enhances multi-turn tool-integrated reasoning in large language models by providing fine-grained rewards and self-supervised signals, leading to improved performance.

Contribution

GTPO offers a novel turn-level reward assignment, advantage estimation, and self-supervised reward shaping, significantly improving training effectiveness over existing methods like GRPO.

Findings

01

GTPO outperforms GRPO by 3.0% on math reasoning benchmarks.

02

GTPO improves performance by 3.9% on commonsense reasoning and program synthesis tasks.

03

GTPO achieves these improvements with negligible computational overhead.

Abstract

Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches. Current RL methods, exemplified by Group Relative Policy Optimization (GRPO), suffer from coarse-grained, trajectory-level rewards that provide insufficient learning signals for complex multi-turn interactions, leading to training stagnation. To address this issue, we propose Group Turn Policy Optimization (GTPO), a novel RL algorithm specifically designed for training LLMs on multi-turn TIR tasks. GTPO introduces three key innovations: (1) turn-level reward assignment that provides fine-grained feedback for individual turns, (2) return-based advantage estimation where normalized discounted returns are calculated as advantages, and (3) self-supervised…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.