Attacking Vision-Language Computer Agents via Pop-ups

Yanzhe Zhang; Tao Yu; Diyi Yang

arXiv:2411.02391·cs.CL·May 27, 2025

Attacking Vision-Language Computer Agents via Pop-ups

Yanzhe Zhang, Tao Yu, Diyi Yang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper reveals that vision-language model-based agents are vulnerable to adversarial pop-ups, which significantly disrupt their task performance and are difficult to defend against.

Contribution

It introduces a novel attack method using adversarial pop-ups against vision-language agents and evaluates its effectiveness in real-world testing environments.

Findings

01

Attack success rate of 86% on average

02

Task success rate decreases by 47%

03

Basic defenses are ineffective

Abstract

Autonomous agents powered by large vision and language models (VLM) have demonstrated significant potential in completing daily computer tasks, such as browsing the web to book travel and operating desktop software, which requires agents to understand these interfaces. Despite such visual inputs becoming more integrated into agentic applications, what types of risks and attacks exist around them still remain unclear. In this work, we demonstrate that VLM agents can be easily attacked by a set of carefully designed adversarial pop-ups, which human users would typically recognize and ignore. This distraction leads agents to click these pop-ups instead of performing their tasks as usual. Integrating these pop-ups into existing agent testing environments like OSWorld and VisualWebArena leads to an attack success rate (the frequency of the agent clicking the pop-ups) of 86% on average and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SALT-NLP/PopupAttack
noneOfficial

Videos

Attacking Vision-Language Computer Agents via Pop-ups· underline

Taxonomy

TopicsNetwork Security and Intrusion Detection · Multi-Agent Systems and Negotiation · Logic, Reasoning, and Knowledge

MethodsEmirates Airlines Office in Dubai · Sparse Evolutionary Training