An Executable Benchmarking Suite for Tool-Using Agents
Zhiqing Zhong, Zhijing Ye, Jiamin Wang, and Xiaodong Yu

TL;DR
This paper introduces an executable benchmarking suite for tool-using agents that explicitly separates workloads, drivers, and evidence, enabling more precise evaluation and comparison of such systems.
Contribution
It presents a standardized, auditable framework connecting multiple environments and defining explicit evidence-admission protocols for benchmarking tool-using agents.
Findings
The suite connects WebArena Verified, SWE-Gym, SWE-bench, and MiniWoB++ environments.
It separates evidence from artifacts for audit and onboarding.
Different controller variants are selected under the same workload using the admission contract.
Abstract
Closed-loop tool-using agents are increasingly evaluated in executable web, code, and micro-task environments, but benchmark reports often conflate workloads, action-generating drivers, and the evidence admitted for systems-facing claims. We present an executable benchmarking suite that makes these objects explicit under a shared evidence-admission contract. The suite connects WebArena Verified, a SWE-Gym slice with SWE-bench-compatible verification, and MiniWoB++ through common workload adapters, task manifests, event schemas, replay/freeze policy, declared drivers, and reporting pipelines. In the canonical release, the gate separates paper-facing evidence from preflight, fixture, smoke, and diagnostic rows while preserving non-admitted artifacts for audit and onboarding. The admitted evidence records latency, invalid-action behavior, patch-generation cost, verifier metadata, replay…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
