TL;DR
This paper introduces OS-BLIND, a benchmark revealing that current computer-use agents are highly vulnerable to benign user instructions leading to harmful outcomes, exposing critical safety gaps.
Contribution
The paper presents OS-BLIND, a comprehensive benchmark for evaluating agent safety under unintended attack conditions, highlighting significant vulnerabilities in frontier models and safety defenses.
Findings
Most CUAs exceed 90% attack success rate on OS-BLIND.
Safety-aligned models like Claude 4.5 Sonnet reach 73% ASR, increasing to 92.7% in multi-agent systems.
Existing safety defenses offer limited protection against benign user instructions.
Abstract
Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit threats such as misuse and prompt injection, but overlook a subtle yet critical setting where user instructions are entirely benign and harm arises from the task context or execution outcome. We introduce OS-BLIND, a benchmark that evaluates CUAs under unintended attack conditions, comprising 300 human-crafted tasks across 12 categories, 8 applications, and 2 threat clusters: environment-embedded threats and agent-initiated harms. Our evaluation on frontier models and agentic frameworks reveals that most CUAs exceed 90% attack success rate (ASR), and even the safety-aligned Claude 4.5 Sonnet reaches 73.0% ASR. More interestingly, this vulnerability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
