TL;DR
This paper introduces a reinforcement learning framework for training large language models to perform multi-step tool orchestration, overcoming environment and reward sparsity challenges, with demonstrated improvements and transferability across different API ecosystems.
Contribution
The authors propose a novel RL approach with constrained data synthesis and graduated rewards, enabling effective multi-step tool orchestration training for LLMs.
Findings
Significant improvement in turn accuracy on ComplexFuncBench.
Effective transfer of orchestration skills to different API ecosystems.
Both reward components are essential for optimal performance.
Abstract
Multi-step tool orchestration remains challenging for LLMs, as state-of-the-art models frequently fail on full sequence execution due to parameter errors. Training for these workflows faces two obstacles: the lack of environments supporting complex real-world API dependencies, and sparse binary rewards that provide no signal for partial correctness. We propose a reinforcement learning framework addressing both challenges. First, we construct a deterministic environment backed by a large-scale cache of real API responses, enabling constrained synthesis of valid multi-step traces with controllable complexity. Second, we introduce a graduated reward that decomposes correctness into atomic validity (call-level correctness at increasing granularity) and orchestration consistency (correct sequencing with dependency respect). On ComplexFuncBench, our approach substantially improves turn…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
