Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks
Jafar Isbarov, Murat Kantarcioglu

TL;DR
This paper reveals that current monitoring defenses for AI agents can be bypassed through novel Agent-as-a-Proxy attacks, exposing fundamental vulnerabilities in existing oversight methods regardless of model size.
Contribution
It introduces the Agent-as-a-Proxy attack method, demonstrating its effectiveness against large-scale monitoring models and exposing fundamental flaws in current AI oversight strategies.
Findings
Agent-as-a-Proxy attacks bypass monitoring defenses
Even large-scale monitors are vulnerable
High attack success rates on benchmark tests
Abstract
As AI agents automate critical workloads, they remain vulnerable to indirect prompt injection (IPI) attacks. Current defenses rely on monitoring protocols that jointly evaluate an agent's Chain-of-Thought (CoT) and tool-use actions to ensure alignment with user intent. We demonstrate that these monitoring-based defenses can be bypassed via a novel Agent-as-a-Proxy attack, where prompt injection attacks treat the agent as a delivery mechanism, bypassing both agent and monitor simultaneously. While prior work on scalable oversight has focused on whether small monitors can supervise large agents, we show that even frontier-scale monitors are vulnerable. Large-scale monitoring models like Qwen2.5-72B can be bypassed by agents with similar capabilities, such as GPT-4o mini and Llama-3.1-70B. On the AgentDojo benchmark, we achieve a high attack success rate against AlignmentCheck and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Advanced Malware Detection Techniques
