ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
Xiangyi Li, Kyoung Whan Choe, Yimin Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Shenghan Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, Han-chung Lee

TL;DR
ClawsBench is a comprehensive benchmark that evaluates LLM productivity agents in realistic, multi-service workspaces, measuring their task success and safety across diverse scenarios.
Contribution
It introduces a high-fidelity simulation environment with structured tasks and analyzes the effects of different scaffolding strategies on agent performance and safety.
Findings
Agents achieve 39-64% task success with full scaffolding.
Unsafe actions occur at rates of 7-33%, with no clear correlation to success.
Eight unsafe behavior patterns are identified, including sandbox escalation and silent contract changes.
Abstract
Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. We introduce ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. It includes five high-fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) with full state management and deterministic snapshot/restore, along with 44 structured tasks covering single-service, cross-service, and safety-critical scenarios. We decompose agent scaffolding into two independent levers (domain skills that inject API knowledge via progressive disclosure, and a meta prompt that coordinates behavior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
