Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning
Siyuan Xu, Shiyang Li, Xin Liu, Tianyi Liu, Yixiao Li, Zhan Shi, Zixuan Zhang, Zilong Wang, Qingyu Yin, Jianshu Chen, Tuo Zhao, Bing Yin

TL;DR
COVERT is a two-stage data synthesis pipeline that creates reliable, complex tool-use environments supporting reinforcement learning for agentic models, enhancing robustness and accuracy.
Contribution
The paper introduces COVERT, a novel method for generating verifiable, complex tool-use data environments that facilitate RL training with reward-checkable online rollouts.
Findings
COVERT-RL improves accuracy on BFCL v3 from 56.5 to 59.9.
COVERT-RL improves accuracy on ACEBench from 53.0 to 59.3.
Stacking on SFT further increases accuracy to 62.1 and 61.8.
Abstract
Existing synthetic tool-use corpora are primarily designed for offline supervised fine-tuning, yet reinforcement learning (RL) requires executable environments that support reward-checkable online rollouts. We propose COVERT, a two-stage pipeline that first generates reliable base tool-use trajectories through self-evolving synthesis with multi-level validation, and then applies oracle-preserving augmentations that systematically increase environmental complexity. These augmentations introduce distractor tools, indirect or ambiguous user queries, and noisy, multi-format, or erroneous tool outputs, while strictly preserving oracle tool calls and final answers as ground truth. This design enables automatic reward computation via reference matching for standard cases and lightweight judge-assisted verification for special behaviors such as error detection, supporting RL optimization of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
