ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

Fanqing Meng; Lingxiao Du; Zijian Wu; Guanzheng Chen; Xiangyan Liu; Jiaqi Liao; Chonghe Jiang; Zhenglin Wan; Jiawei Gu; Pengfei Zhou; Rui Huang; Ziqi Zhao; Shengyuan Ding; Ailing Yu; Bo Peng; Bowei Xia; Hao Sun; Haotian Liang; Ji Xie; Jiajun Chen; Jiajun Song; Liu Yang; Ming Xu; Qionglin Qiu; Runhao Fu; Shengfang Zhai; Shijian Wang; Tengfei Ma; Tianyi Wu; Weiyang Jin; Yan Wang; Yang Dai; Yao Lai; Youwei Shu; Yue Liu; Yunzhuo Hao; Yuwei Niu; Jinkai Huang; Jiayuan Zhuo; Zhennan Shen; Linyu Wu; Hannah Yao; Charles Chen; Cihang Xie; Yuyin Zhou; Jiaheng Zhang; Zeyu Zheng; Mengkang Hu; Michael Qizhe Shieh

arXiv:2604.23781·cs.CV·May 6, 2026

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

Fanqing Meng, Lingxiao Du, Zijian Wu, Guanzheng Chen, Xiangyan Liu, Jiaqi Liao, Chonghe Jiang, Zhenglin Wan, Jiawei Gu, Pengfei Zhou, Rui Huang, Ziqi Zhao, Shengyuan Ding, Ailing Yu, Bo Peng, Bowei Xia, Hao Sun, Haotian Liang, Ji Xie, Jiajun Chen, Jiajun Song, Liu Yang, Ming Xu

PDF

1 Repo

TL;DR

ClawMark introduces a comprehensive benchmark for evaluating persistent, multi-turn, multi-day coworker agents across evolving multimodal environments, highlighting current challenges and progress.

Contribution

It presents a new benchmark with diverse tasks and a stateful environment to evaluate multimodal coworker agents over multiple days, addressing limitations of existing static, text-centric benchmarks.

Findings

01

The strongest model achieves 75.8 overall score but only 20.0% task success.

02

Performance declines after environment updates, indicating adaptation challenges.

03

Partial progress is common, but full workflow completion remains difficult.

Abstract

Language-model agents are increasingly used as persistent coworkers that assist users across multiple working days. During such workflows, the surrounding environment may change independently of the agent: new emails arrive, calendar entries shift, knowledge-base records are updated, and evidence appears across images, scanned PDFs, audio, video, and spreadsheets. Existing benchmarks do not adequately evaluate this setting because they typically run within a single static episode and remain largely text-centric. We introduce \bench{}, a benchmark for coworker agents built around multi-turn multi-day tasks, a stateful sandboxed service environment whose state evolves between turns, and rule-based verification. The current release contains 100 tasks across 13 professional scenarios, executed against five stateful sandboxed services (filesystem, email, calendar, knowledge base,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

evolvent-ai/ClawMark
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.