ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Xiangyi Li; Kyoung Whan Choe; Yimin Liu; Xiaokun Chen; Chujun Tao; Bingran You; Wenbo Chen; Zonglin Di; Jiankai Sun; Shenghan Zheng; Jiajun Bao; Yuanli Wang; Weixiang Yan; Yiyuan Li; Han-chung Lee

arXiv:2604.05172·cs.AI·April 9, 2026

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, Han-chung Lee

PDF

2 Repos 1 Datasets

TL;DR

ClawsBench is a comprehensive benchmark that evaluates LLM productivity agents in realistic, multi-service workspaces, measuring their task success and safety across diverse scenarios.

Contribution

It introduces a high-fidelity simulation environment with structured tasks and analyzes the effects of different scaffolding strategies on agent performance and safety.

Findings

01

Agents achieve 39-64% task success with full scaffolding.

02

Unsafe actions occur at rates of 7-33%, with no clear correlation to success.

03

Eight unsafe behavior patterns are identified, including sandbox escalation and silent contract changes.

Abstract

Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. We introduce ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. It includes five high-fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) with full state management and deterministic snapshot/restore, along with 44 structured tasks covering single-service, cross-service, and safety-critical scenarios. We decompose agent scaffolding into two independent levers (domain skills that inject API knowledge via progressive disclosure, and a meta prompt that coordinates behavior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

benchflow/ClawsBench
dataset· 885 dl
885 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.