Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

Bowen Ye; Rang Li; Qibin Yang; Yuanxin Liu; Linli Yao; Hanglong Lv; Zhihui Xie; Chenxin An; Lei Li; Lingpeng Kong; Qi Liu; Zhifang Sui; Tong Yang

arXiv:2604.06132·cs.AI·May 8, 2026

Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An, Lei Li, Lingpeng Kong, Qi Liu, Zhifang Sui, Tong Yang

PDF

1 Repo 1 Datasets

TL;DR

Claw-Eval is a comprehensive evaluation suite for autonomous agents that addresses limitations of existing benchmarks by providing trajectory-aware grading, safety, robustness, and multimodal coverage, revealing nuanced insights into model capabilities.

Contribution

Introduces Claw-Eval, an end-to-end, multi-modal evaluation framework with detailed grading and safety metrics, improving reliability over existing benchmarks.

Findings

01

Trajectory-opaque evaluation misses safety and robustness issues.

02

Capability does not guarantee consistency across tasks.

03

Model rankings vary significantly across different metrics and task groups.

Abstract

Large language models are increasingly deployed as autonomous agents for multi-step workflows in real-world software environments. However, existing agent benchmarks are limited by trajectory-opaque grading, underspecified safety and robustness evaluation, and narrow coverage of modalities and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing these gaps with 300 human-verified tasks spanning 9 categories across three groups: general service orchestration, multimodal perception and interaction, and multi-turn professional dialogue. To enable trajectory-aware grading, each run is recorded through three independent evidence channels: execution traces, audit logs, and environment snapshots, yielding 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, with Average Score, Pass@k, and Pass^k across three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

claw-eval/claw-eval
github

Datasets

claw-eval/Claw-Eval
dataset· 4.5k dl
4.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.