ClawBench: Can AI Agents Complete Everyday Online Tasks?
Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu, Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen, Dongfu Jiang, Wenhu Chen, Kelsey R. Allen

TL;DR
ClawBench is a new evaluation framework for AI agents to perform 153 real-world online tasks across diverse platforms, highlighting current models' limited capabilities in complex web interactions.
Contribution
Introduces ClawBench, a comprehensive, real-world web task benchmark that challenges AI agents with dynamic, multi-step online activities on live websites.
Findings
Current AI models complete only a small portion of tasks, e.g., 33.3% by Claude Sonnet 4.6.
ClawBench captures real-world web interactions safely without side effects.
The benchmark covers diverse categories and complex multi-step workflows.
Abstract
AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- NAIL-Group/ClawBenchdataset· 405 dl405 dl
- molmohsen/awesome-ai-agent-papersdataset· 39 dl39 dl
- NAIL-Group/ClawBenchV1Tracedataset· 1.3k dl1.3k dl
- Duke313/ClawBench-testdataset· 114 dl114 dl
- TIGER-Lab/ClawBenchdataset· 354 dl354 dl
- ZhangArthurHao/ClawBenchV1Tracedataset· 265 dl265 dl
- NAIL-Group/ClawBenchV2Tracedataset· 386 dl386 dl
- TIGER-Lab/ClawBenchV2Tracedataset· 330 dl330 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
