ClawBench: Can AI Agents Complete Everyday Online Tasks?

Yuxuan Zhang; Yubo Wang; Yipeng Zhu; Penghui Du; Junwen Miao; Xuan Lu; Wendong Xu; Yunzhuo Hao; Songcheng Cai; Xiaochen Wang; Huaisong Zhang; Xian Wu; Yi Lu; Minyi Lei; Kai Zou; Huifeng Yin; Ping Nie; Liang Chen; Dongfu Jiang; Wenhu Chen; Kelsey R. Allen

arXiv:2604.08523·cs.CL·April 10, 2026

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu, Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen, Dongfu Jiang, Wenhu Chen, Kelsey R. Allen

PDF

1 Repo 8 Datasets

TL;DR

ClawBench is a new evaluation framework for AI agents to perform 153 real-world online tasks across diverse platforms, highlighting current models' limited capabilities in complex web interactions.

Contribution

Introduces ClawBench, a comprehensive, real-world web task benchmark that challenges AI agents with dynamic, multi-step online activities on live websites.

Findings

01

Current AI models complete only a small portion of tasks, e.g., 33.3% by Claude Sonnet 4.6.

02

ClawBench captures real-world web interactions safely without side effects.

03

The benchmark covers diverse categories and complex multi-step workflows.

Abstract

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

reacher-z/ClawBench
github

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.