An Executable Benchmarking Suite for Tool-Using Agents

Zhiqing Zhong; Zhijing Ye; Jiamin Wang; and Xiaodong Yu

arXiv:2605.11030·cs.SE·May 13, 2026

An Executable Benchmarking Suite for Tool-Using Agents

Zhiqing Zhong, Zhijing Ye, Jiamin Wang, and Xiaodong Yu

PDF

TL;DR

This paper introduces an executable benchmarking suite for tool-using agents that explicitly separates workloads, drivers, and evidence, enabling more precise evaluation and comparison of such systems.

Contribution

It presents a standardized, auditable framework connecting multiple environments and defining explicit evidence-admission protocols for benchmarking tool-using agents.

Findings

01

The suite connects WebArena Verified, SWE-Gym, SWE-bench, and MiniWoB++ environments.

02

It separates evidence from artifacts for audit and onboarding.

03

Different controller variants are selected under the same workload using the admission contract.

Abstract

Closed-loop tool-using agents are increasingly evaluated in executable web, code, and micro-task environments, but benchmark reports often conflate workloads, action-generating drivers, and the evidence admitted for systems-facing claims. We present an executable benchmarking suite that makes these objects explicit under a shared evidence-admission contract. The suite connects WebArena Verified, a SWE-Gym slice with SWE-bench-compatible verification, and MiniWoB++ through common workload adapters, task manifests, event schemas, replay/freeze policy, declared drivers, and reporting pipelines. In the canonical release, the gate separates paper-facing evidence from preflight, fixture, smoke, and diagnostic rows while preserving non-admitted artifacts for audit and onboarding. The admitted evidence records latency, invalid-action behavior, patch-generation cost, verifier metadata, replay…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.