Evaluating Control Protocols for Untrusted AI Agents

Jon Kutasov; Chloe Loughridge; Yuqi Sun; Henry Sleight; Buck Shlegeris; Tyler Tracy; Joe Benton

arXiv:2511.02997·cs.AI·November 6, 2025

Evaluating Control Protocols for Untrusted AI Agents

Jon Kutasov, Chloe Loughridge, Yuqi Sun, Henry Sleight, Buck Shlegeris, Tyler Tracy, Joe Benton

PDF

Open Access

TL;DR

This paper systematically evaluates control protocols for untrusted AI agents, demonstrating that deferring on critical actions offers robust safety, while resampling improves safety against default attacks, but can be bypassed by adaptive adversaries.

Contribution

It provides a comprehensive assessment of various control protocols in diverse environments, highlighting the robustness of deferring on critical actions against adaptive attacks.

Findings

01

Resampling increases safety from 50% to 96%.

02

Deferring on critical actions remains highly robust.

03

Adaptive attack strategies can significantly reduce safety.

Abstract

As AI systems become more capable and widely deployed as agents, ensuring their safe operation becomes critical. AI control offers one approach to mitigating the risk from untrusted AI agents by monitoring their actions and intervening or auditing when necessary. Evaluating the safety of these protocols requires understanding both their effectiveness against current attacks and their robustness to adaptive adversaries. In this work, we systematically evaluate a range of control protocols in SHADE-Arena, a dataset of diverse agentic environments. First, we evaluate blue team protocols, including deferral to trusted models, resampling, and deferring on critical actions, against a default attack policy. We find that resampling for incrimination and deferring on critical actions perform best, increasing safety from 50% to 96%. We then iterate on red team strategies against these protocols…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Advanced Malware Detection Techniques