The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

Xuwei Ding; Skylar Zhai; Linxin Song; Jiate Li; Taiwei Shi; Nicholas Meade; Siva Reddy; Jian Kang; Jieyu Zhao

arXiv:2604.10577·cs.CR·April 20, 2026

The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

Xuwei Ding, Skylar Zhai, Linxin Song, Jiate Li, Taiwei Shi, Nicholas Meade, Siva Reddy, Jian Kang, Jieyu Zhao

PDF

1 Repo

TL;DR

This paper introduces OS-BLIND, a benchmark revealing that current computer-use agents are highly vulnerable to benign user instructions leading to harmful outcomes, exposing critical safety gaps.

Contribution

The paper presents OS-BLIND, a comprehensive benchmark for evaluating agent safety under unintended attack conditions, highlighting significant vulnerabilities in frontier models and safety defenses.

Findings

01

Most CUAs exceed 90% attack success rate on OS-BLIND.

02

Safety-aligned models like Claude 4.5 Sonnet reach 73% ASR, increasing to 92.7% in multi-agent systems.

03

Existing safety defenses offer limited protection against benign user instructions.

Abstract

Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit threats such as misuse and prompt injection, but overlook a subtle yet critical setting where user instructions are entirely benign and harm arises from the task context or execution outcome. We introduce OS-BLIND, a benchmark that evaluates CUAs under unintended attack conditions, comprising 300 human-crafted tasks across 12 categories, 8 applications, and 2 threat clusters: environment-embedded threats and agent-initiated harms. Our evaluation on frontier models and agentic frameworks reveals that most CUAs exceed 90% attack success rate (ASR), and even the safety-aligned Claude 4.5 Sonnet reaches 73.0% ASR. More interestingly, this vulnerability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

limenlp/OS_Blind
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.