Dissecting Adversarial Robustness of Multimodal LM Agents
Chen Henry Wu, Rishi Shah, Jing Yu Koh, Ruslan Salakhutdinov, Daniel, Fried, Aditi Raghunathan

TL;DR
This paper introduces the ARE framework to evaluate the adversarial robustness of multimodal language model agents, revealing vulnerabilities in recent systems through targeted attacks and analyzing how component additions affect robustness.
Contribution
We propose the Agent Robustness Evaluation (ARE) framework for systematic robustness assessment of multimodal LM agents, including a new threat model and attack methods.
Findings
Latest agents are vulnerable to imperceptible image perturbations with up to 67% success rate.
Adding new components can decrease robustness, with attack success increasing by up to 20%.
Inference-time compute may unintentionally introduce new vulnerabilities.
Abstract
As language models (LMs) are used to build autonomous agents in real environments, ensuring their adversarial robustness becomes a critical challenge. Unlike chatbots, agents are compound systems with multiple components taking actions, which existing LMs safety evaluations do not adequately address. To bridge this gap, we manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena, a real environment for web agents. To systematically examine the robustness of agents, we propose the Agent Robustness Evaluation (ARE) framework. ARE views the agent as a graph showing the flow of intermediate outputs between components and decomposes robustness as the flow of adversarial information on the graph. We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and…
Peer Reviews
Decision·ICLR 2025 Poster
- The authors look at the entire deployed system and not only at the LLM component of the system. This is very important as the individual components of the system might amplify or reduce robustness (as shown by the paper). - While at the beginning feels a bit of an unnecessary formalization, I ended up liking describing the system as a directed graph where the weights of the nodes represent the likelyhood of success of the attack after the given component. - VWA-Adv seems to be a good dataset t
- The main complain I have is that there is no discussion whatsoever regarding (indirect) prompt injection attacks [1] and all the related literature. Prompt injection attacks are very relevant as they use a similar attack vector (an adversary manipulating untrusted data) which has the same aim (trigger some specific actions). The paper would benefit from a discussion of prompt injection attacks and an explanation on how "Text access" attacks differ from those. - I find the caption of the figure
The framework models agents as graphs, with each node representing an agent component and each edge representing the flow of information between components. ARE decomposes the final attack success into edge weights that measure the adversarial influence of information propagated on the edge. This framework could allow researchers to understand the robustness/vulnerability of various components and agent configurations. The paper shows some baseline defenses based on prompting and consistency ch
The main weakness of the submitted paper is that it does not clearly motivate and explain why the proposed framework is representative of a real scenario and the usage of integrated LLMs. Although we can, of course, not expect to model the real world perfectly, it remains unclear why the proposed framework is a good approximation. **The study focused on a specific threat model** where the attacker is a legitimate user of the platform with limited capabilities. This might not encompass the full
The paper is well-written and easy to follow. Evaluations are comprehensive.
The contribution of the paper lies more on the construction of the robustness evaluation framework for agent framework. More discussions on the technical challenges of this would be beneficial to emphasize the contribution.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
