When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents

Xiaolin Zhou; Aojie Yuan; Zheng Luo; Zipeng Ling; Xixiao Pan; Yicheng Gao; Haiyue Zhang; Jiate Li; Shuli Jiang; Prince Zizhuang Wang; Zixuan Zhu; Jinbo Liu; Ryan A. Rossi; Hua Wei; and Xiyang Hu

arXiv:2605.11928·cs.AI·May 13, 2026

When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents

Xiaolin Zhou, Aojie Yuan, Zheng Luo, Zipeng Ling, Xixiao Pan, Yicheng Gao, Haiyue Zhang, Jiate Li, Shuli Jiang, Prince Zizhuang Wang, Zixuan Zhu, Jinbo Liu, Ryan A. Rossi, Hua Wei, and Xiyang Hu

PDF

TL;DR

This paper introduces RobustBench-TC, a benchmark for evaluating tool-use language agents under real-world noise, and proposes ToolRL-DR, a domain-randomized RL method to improve robustness against such perturbations.

Contribution

The paper presents a new benchmark with real-world perturbations and a domain-randomized RL recipe that enhances tool-use agent robustness across various noisy conditions.

Findings

01

Observation perturbations minimally affect accuracy (<5%)

02

Reward and transition perturbations significantly reduce accuracy (~30-40%)

03

ToolRL-DR improves robustness, retaining 75% of clean accuracy and narrowing performance gaps

Abstract

Tool-use language agents are evaluated on benchmarks that assume clean inputs, unambiguous tool registries, and reliable APIs. Real deployments violate all these assumptions: user typos propagate into hallucinated tool names, a misconfigured request timeout can stall an agent indefinitely, and duplicate tool names across servers can freeze an SDK. We study these failures as a sim-to-real gap in the tool-use partially observable Markov decision process (POMDP), where deployment noise enters through the observation, action space, reward-relevant metadata, or transition dynamics. We introduce RobustBench-TC, a benchmark with 22 perturbation types organized by these four POMDP components, each grounded in a verified GitHub issue or documented tool-calling failure. Across 21 models from 1.5B to 32B parameters (including the closed-source o4-mini), the robustness profile is sharply uneven:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.