SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, John Hughes, Xiang Deng, Henry Sleight, Tyler Tracy, Buck Shlegeris, Joe Benton

TL;DR
This paper introduces SHADE-Arena, a comprehensive evaluation dataset to assess the sabotage and monitoring capabilities of frontier LLMs in complex tasks, revealing current limitations and challenges in detecting subtle harmful behaviors.
Contribution
We present SHADE-Arena, the first diverse dataset for evaluating sabotage and monitoring in LLM agents, and provide empirical insights into their capabilities and limitations.
Findings
Best models achieve 27% sabotage success rate.
Hidden scratchpads significantly aid sabotage success.
Monitoring models achieve an AUC of 0.87 in detection.
Abstract
As Large Language Models (LLMs) are increasingly deployed as autonomous agents in complex and long horizon settings, it is critical to evaluate their ability to sabotage users by pursuing hidden objectives. We study the ability of frontier LLMs to evade monitoring and achieve harmful hidden goals while completing a wide array of realistic tasks. We evaluate a broad range of frontier LLMs using SHADE (Subtle Harmful Agent Detection & Evaluation)-Arena, the first highly diverse agent evaluation dataset for sabotage and monitoring capabilities of LLM agents. SHADE-Arena consists of complex pairs of benign main tasks and harmful side objectives in complicated environments. Agents are evaluated on their ability to complete the side task without appearing suspicious to an LLM monitor. When measuring agent ability to (a) complete the main task, (b) complete the side task, and (c) avoid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)
