Optimizing AI Agent Attacks With Synthetic Data

Chloe Loughridge; Paul Colognese; Avery Griffin; Tyler Tracy; Jon Kutasov; Joe Benton

arXiv:2511.02823·cs.AI·November 5, 2025

Optimizing AI Agent Attacks With Synthetic Data

Chloe Loughridge, Paul Colognese, Avery Griffin, Tyler Tracy, Jon Kutasov, Joe Benton

PDF

Open Access

TL;DR

This paper presents a method to optimize attack policies against AI agents in complex environments by decomposing attack skills, using probabilistic modeling, and transferring results from simulation to real scenarios, significantly improving attack effectiveness.

Contribution

It introduces a novel approach to optimize attack policies by decomposing skills and employing probabilistic models, enabling effective attacks with limited data in complex environments.

Findings

01

Attack safety score reduced from 0.87 to 0.41

02

Decomposition of attack skills improves policy optimization

03

Probabilistic modeling enables transfer from simulation to real environments

Abstract

As AI deployments become more complex and high-stakes, it becomes increasingly important to be able to estimate their risk. AI control is one framework for doing so. However, good control evaluations require eliciting strong attack policies. This can be challenging in complex agentic environments where compute constraints leave us data-poor. In this work, we show how to optimize attack policies in SHADE-Arena, a dataset of diverse realistic control environments. We do this by decomposing attack capability into five constituent skills -- suspicion modeling, attack selection, plan synthesis, execution and subtlety -- and optimizing each component individually. To get around the constraint of limited data, we develop a probabilistic model of attack dynamics, optimize our attack hyperparameters using this simulation, and then show that the results transfer to SHADE-Arena. This results in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Smart Grid Security and Resilience · Network Security and Intrusion Detection