Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition

Andy Zou; Maxwell Lin; Eliot Jones; Micha Nowak; Mateusz Dziemian; Nick Winter; Alexander Grattan; Valent Nathanael; Ayla Croft; Xander Davies; Jai Patel; Robert Kirk; Nate Burnikell; Yarin Gal; Dan Hendrycks; J. Zico Kolter; Matt Fredrikson

arXiv:2507.20526·cs.AI·July 29, 2025

Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition

Andy Zou, Maxwell Lin, Eliot Jones, Micha Nowak, Mateusz Dziemian, Nick Winter, Alexander Grattan, Valent Nathanael, Ayla Croft, Xander Davies, Jai Patel, Robert Kirk, Nate Burnikell, Yarin Gal, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson

PDF

TL;DR

This paper presents a large-scale red-teaming competition revealing significant security vulnerabilities in AI agents, introduces the ART benchmark for evaluating robustness, and emphasizes the need for improved defenses against adversarial attacks.

Contribution

It introduces the largest public red-teaming competition for AI agents, creates the ART benchmark, and provides comprehensive analysis of vulnerabilities across state-of-the-art models.

Findings

01

Most agents exhibit policy violations within 10-100 queries.

02

High transferability of attacks across different models and tasks.

03

Limited correlation between robustness and model size or compute.

Abstract

Recent advances have enabled LLM-powered AI agents to autonomously execute complex tasks by combining language model reasoning with tools, memory, and web access. But can these systems be trusted to follow deployment policies in realistic environments, especially under attack? To investigate, we ran the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios. Participants submitted 1.8 million prompt-injection attacks, with over 60,000 successfully eliciting policy violations such as unauthorized data access, illicit financial actions, and regulatory noncompliance. We use these results to build the Agent Red Teaming (ART) benchmark - a curated set of high-impact attacks - and evaluate it across 19 state-of-the-art models. Nearly all agents exhibit policy violations for most behaviors within 10-100 queries, with high attack…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.