TL;DR
This paper introduces a novel reinforcement learning approach with iterative reward calibration for tool-calling agents, significantly improving performance on customer service benchmarks with smaller models surpassing larger ones.
Contribution
It presents the first application of MT-GRPO and GTPO for training tool-calling agents, addressing reward misalignment with a new calibration methodology.
Findings
Improved Qwen3.5-4B accuracy from 63.8% to 66.7%.
Enhanced Qwen3-30B-A3B accuracy from 58.0% to 69.5%.
Trained models outperform GPT-4.1 and GPT-4o despite smaller size.
Abstract
Training tool-calling agents with reinforcement learning on multi-turn tasks remains challenging due to sparse outcome rewards and difficult credit assignment across conversation turns. We present the first application of MT-GRPO (Multi-Turn Group Relative Policy Optimization) combined with GTPO (Generalized Token-level Policy Optimization) for training a tool-calling agent on realistic customer service tasks with an LLM-based user simulator. Through systematic analysis of training rollouts, we discover that naively designed dense per-turn rewards degrade performance by up to 14 percentage points due to misalignment between reward discriminativeness and advantage direction. We introduce Iterative Reward Calibration, a methodology for designing per-turn rewards using empirical discriminative analysis of rollout data, and show that our GTPO hybrid advantage formulation eliminates the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
