CocoaBench: Evaluating Unified Digital Agents in the Wild

CocoaBench Team: Shibo Hao; Zhining Zhang; Zhiqi Liang; Tianyang Liu; Yuheng Zha; Qiyue Gao; Jixuan Chen; Zilong Wang; Zhoujun Cheng; Haoxiang Zhang; Junli Wang; Hexi Jin; Boyuan Zheng; Kun Zhou; Yu Wang; Feng Yao; Licheng Liu; Yijiang Li; Zhifei Li; Zhengtao Han; Pracha Promthaw; Tommaso Cerruti; Xiaohan Fu; Ziqiao Ma; Jingbo Shang; Lianhui Qin; Julian McAuley; Eric P. Xing; Zhengzhong Liu; Rupesh Kumar Srivastava; Zhiting Hu

arXiv:2604.11201·cs.CL·April 15, 2026

CocoaBench: Evaluating Unified Digital Agents in the Wild

CocoaBench Team: Shibo Hao, Zhining Zhang, Zhiqi Liang, Tianyang Liu, Yuheng Zha, Qiyue Gao, Jixuan Chen, Zilong Wang, Zhoujun Cheng, Haoxiang Zhang, Junli Wang, Hexi Jin, Boyuan Zheng, Kun Zhou, Yu Wang, Feng Yao, Licheng Liu, Yijiang Li, Zhifei Li, Zhengtao Han

PDF

1 Repo

TL;DR

CocoaBench is a new benchmark designed to evaluate the performance of unified digital agents across diverse, long-horizon tasks that require combining vision, search, and coding capabilities.

Contribution

The paper introduces CocoaBench and CocoaAgent, enabling scalable, reliable evaluation of integrated agents on complex tasks requiring multiple capabilities.

Findings

01

Current agents achieve only 45.1% success on CocoaBench.

02

Significant gaps remain in reasoning, planning, and visual grounding.

03

CocoaBench enables assessment of agents' ability to handle diverse, long-horizon tasks.

Abstract

LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cocoabench/cocoa-agent
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.