TL;DR
ClawSafety introduces a comprehensive benchmark to evaluate the safety of AI agents in high-privilege, realistic work environments, revealing significant vulnerabilities and the importance of the entire deployment stack.
Contribution
This work presents a new benchmark with 120 adversarial scenarios for evaluating AI agent safety in realistic settings, highlighting the impact of the full deployment stack.
Findings
Attack success rates range from 40% to 75% across models.
Skill instructions pose the highest risk among attack vectors.
Safety depends on both the model and the deployment framework.
Abstract
Personal AI agents like OpenClaw run with elevated privileges on users' local machines, where a single successful prompt injection can leak credentials, redirect financial transactions, or destroy files. This threat goes well beyond conventional text-level jailbreaks, yet existing safety evaluations fall short: most test models in isolated chat settings, rely on synthetic environments, and do not account for how the agent framework itself shapes safety outcomes. We introduce CLAWSAFETY, a benchmark of 120 adversarial test scenarios organized along three dimensions (harm domain, attack vector, and harmful action type) and grounded in realistic, high-privilege professional workspaces spanning software engineering, finance, healthcare, law, and DevOps. Each test case embeds adversarial content in one of three channels the agent encounters during normal work: workspace skill files, emails…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
