Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

Cheng Jiayang; Xin Liu; Zhihan Zhang; Haoyang Wen; Zixuan Zhang; Qingyu Yin; Shiyang Li; Priyanka Nigam; Bing Yin; Chao Zhang; Yangqiu Song

arXiv:2603.24709·cs.LG·April 8, 2026

Training LLMs for Multi-Step Tool Orchestration with Constrained Data Synthesis and Graduated Rewards

Cheng Jiayang, Xin Liu, Zhihan Zhang, Haoyang Wen, Zixuan Zhang, Qingyu Yin, Shiyang Li, Priyanka Nigam, Bing Yin, Chao Zhang, Yangqiu Song

PDF

1 Repo

TL;DR

This paper introduces a reinforcement learning framework for training large language models to perform multi-step tool orchestration, overcoming environment and reward sparsity challenges, with demonstrated improvements and transferability across different API ecosystems.

Contribution

The authors propose a novel RL approach with constrained data synthesis and graduated rewards, enabling effective multi-step tool orchestration training for LLMs.

Findings

01

Significant improvement in turn accuracy on ComplexFuncBench.

02

Effective transfer of orchestration skills to different API ecosystems.

03

Both reward components are essential for optimal performance.

Abstract

Multi-step tool orchestration remains challenging for LLMs, as state-of-the-art models frequently fail on full sequence execution due to parameter errors. Training for these workflows faces two obstacles: the lack of environments supporting complex real-world API dependencies, and sparse binary rewards that provide no signal for partial correctness. We propose a reinforcement learning framework addressing both challenges. First, we construct a deterministic environment backed by a large-scale cache of real API responses, enabling constrained synthesis of valid multi-step traces with controllable complexity. Second, we introduce a graduated reward that decomposes correctness into atomic validity (call-level correctness at increasing granularity) and orchestration consistency (correct sequencing with dependency respect). On ComplexFuncBench, our approach substantially improves turn…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

horizon-rl/ToolOrchestrationReward
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.