Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

Wachiravit Modecrua; Krittanon Kaewtawee; Krittin Pachtrachai; Touchapon Kraisingkorn

arXiv:2604.02869·cs.AI·April 6, 2026

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

Wachiravit Modecrua, Krittanon Kaewtawee, Krittin Pachtrachai, Touchapon Kraisingkorn

PDF

1 Repo

TL;DR

This paper introduces a novel reinforcement learning approach with iterative reward calibration for tool-calling agents, significantly improving performance on customer service benchmarks with smaller models surpassing larger ones.

Contribution

It presents the first application of MT-GRPO and GTPO for training tool-calling agents, addressing reward misalignment with a new calibration methodology.

Findings

01

Improved Qwen3.5-4B accuracy from 63.8% to 66.7%.

02

Enhanced Qwen3-30B-A3B accuracy from 58.0% to 69.5%.

03

Trained models outperform GPT-4.1 and GPT-4o despite smaller size.

Abstract

Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Relative Policy Optimization) combined with GTPO (Generalized Token-level Policy Optimization) for training a tool-calling agent on realistic customer service tasks with an LLM-based user simulator. Through systematic analysis of training rollouts, we discover that naively designed dense per-turn rewards degrade performance by up to 14 percentage points due to misalignment between reward discriminativeness and advantage direction. We introduce Iterative Reward Calibration, a methodology for designing per-turn rewards using empirical discriminative analysis of rollout data, and show that our GTPO hybrid advantage formulation eliminates the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.