WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Shuangrui Ding; Xuanlang Dai; Long Xing; Shengyuan Ding; Ziyu Liu; Yang JingYi; Penghui Yang; Zhixiong Zhang; Xilin Wei; Xinyu Fang; Yubo Ma; Haodong Duan; Jing Shao; Jiaqi Wang; Dahua Lin; Kai Chen; Yuhang Zang

arXiv:2605.10912·cs.CL·May 12, 2026

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Yang JingYi, Penghui Yang, Zhixiong Zhang, Xilin Wei, Xinyu Fang, Yubo Ma, Haodong Duan, Jing Shao, Jiaqi Wang, Dahua Lin, Kai Chen, Yuhang Zang

PDF

1 Repo 1 Datasets

TL;DR

WildClawBench introduces a comprehensive benchmark for evaluating large language and vision-language models on realistic, long-horizon, multimodal tasks using real tools in native runtime environments.

Contribution

This work provides the first native-runtime, multimodal benchmark with real tools and long-duration tasks for assessing agent capabilities in realistic settings.

Findings

01

Claude Opus 4.7 achieves 62.2% overall performance.

02

Most models score below 60%, indicating room for improvement.

03

Switching harnesses can significantly impact model performance by up to 18 points.

Abstract

Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

internlm/WildClawBench
github

Datasets

internlm/WildClawBench
dataset· 8.3k dl
8.3k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.