How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition

Mateusz Dziemian; Maxwell Lin; Xiaohan Fu; Micha Nowak; Nick Winter; Eliot Jones; Andy Zou; Lama Ahmad; Kamalika Chaudhuri; Sahana Chennabasappa; Xander Davies; Lauren Deason; Benjamin L. Edelman; Tanner Emek; Ivan Evtimov; Jim Gust; Maia Hamin; Kat He; Klaudia Krawiecka; Riccardo Patana; Neil Perry; Troy Peterson; Xiangyu Qi; Javier Rando; Zifan Wang; Zihan Wang; Spencer Whitman; Eric Winsor; Arman Zharmagambetov; Matt Fredrikson; Zico Kolter

arXiv:2603.15714·cs.CR·March 18, 2026

How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition

Mateusz Dziemian, Maxwell Lin, Xiaohan Fu, Micha Nowak, Nick Winter, Eliot Jones, Andy Zou, Lama Ahmad, Kamalika Chaudhuri, Sahana Chennabasappa, Xander Davies, Lauren Deason, Benjamin L. Edelman, Tanner Emek, Ivan Evtimov, Jim Gust, Maia Hamin, Kat He, Klaudia Krawiecka

PDF

Open Access 1 Datasets

TL;DR

This study evaluates the vulnerability of large language model-based agents to indirect prompt injection attacks through a large-scale public competition, revealing widespread weaknesses and universal attack strategies across multiple models and scenarios.

Contribution

It introduces a comprehensive red teaming competition with extensive attack data, highlighting fundamental instruction-following weaknesses and providing open resources for robustness research.

Findings

01

All models tested were vulnerable to prompt injections.

02

Universal attack strategies transfer across multiple models and behaviors.

03

High capability models also exhibit high vulnerability.

Abstract

LLM based agents are increasingly deployed in high stakes settings where they process external data sources such as emails, documents, and code repositories. This creates exposure to indirect prompt injection attacks, where adversarial instructions embedded in external content manipulate agent behavior without user awareness. A critical but underexplored dimension of this threat is concealment: since users tend to observe only an agent's final response, an attack can conceal its existence by presenting no clue of compromise in the final user facing response while successfully executing harmful actions. This leaves users unaware of the manipulation and likely to accept harmful outcomes as legitimate. We present findings from a large scale public red teaming competition evaluating this dual objective across three agent settings: tool calling, coding, and computer use. The competition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

sureheremarv/ipi_arena_attacks
dataset· 82 dl
82 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Web Application Security Vulnerabilities · Security and Verification in Computing