SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents

Jonathan Kutasov; Yuqi Sun; Paul Colognese; Teun van der Weij; Linda Petrini; Chen Bo Calvin Zhang; John Hughes; Xiang Deng; Henry Sleight; Tyler Tracy; Buck Shlegeris; Joe Benton

arXiv:2506.15740·cs.AI·July 10, 2025

SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents

Jonathan Kutasov, Yuqi Sun, Paul Colognese, Teun van der Weij, Linda Petrini, Chen Bo Calvin Zhang, John Hughes, Xiang Deng, Henry Sleight, Tyler Tracy, Buck Shlegeris, Joe Benton

PDF

Open Access

TL;DR

This paper introduces SHADE-Arena, a comprehensive evaluation dataset to assess the sabotage and monitoring capabilities of frontier LLMs in complex tasks, revealing current limitations and challenges in detecting subtle harmful behaviors.

Contribution

We present SHADE-Arena, the first diverse dataset for evaluating sabotage and monitoring in LLM agents, and provide empirical insights into their capabilities and limitations.

Findings

01

Best models achieve 27% sabotage success rate.

02

Hidden scratchpads significantly aid sabotage success.

03

Monitoring models achieve an AUC of 0.87 in detection.

Abstract

As Large Language Models (LLMs) are increasingly deployed as autonomous agents in complex and long horizon settings, it is critical to evaluate their ability to sabotage users by pursuing hidden objectives. We study the ability of frontier LLMs to evade monitoring and achieve harmful hidden goals while completing a wide array of realistic tasks. We evaluate a broad range of frontier LLMs using SHADE (Subtle Harmful Agent Detection & Evaluation)-Arena, the first highly diverse agent evaluation dataset for sabotage and monitoring capabilities of LLM agents. SHADE-Arena consists of complex pairs of benign main tasks and harmful side objectives in complicated environments. Agents are evaluated on their ability to complete the side task without appearing suspicious to an LLM monitor. When measuring agent ability to (a) complete the main task, (b) complete the side task, and (c) avoid…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI)