Reliable Weak-to-Strong Monitoring of LLM Agents
Neil Kale, Chen Bo Calvin Zhang, Kevin Zhu, Ankit Aich, Paula Rodriguez, Scale Red Team, Christina Q. Knight, and Zifan Wang

TL;DR
This paper systematically evaluates the robustness of monitoring systems for detecting covert misbehavior in autonomous LLM agents, revealing key factors affecting monitor reliability and proposing improved scaffolding methods.
Contribution
It introduces a comprehensive red teaming workflow for LLM monitor evaluation, including new datasets, environments, and a hybrid scaffolding approach, highlighting the importance of agent awareness and scaffolding design.
Findings
Agent awareness significantly reduces monitor reliability.
Hybrid scaffolding outperforms baseline methods.
Targeted human oversight improves detection accuracy.
Abstract
We stress test monitoring systems for detecting covert misbehavior in autonomous LLM agents (e.g., secretly sharing private information). To this end, we systematize a monitor red teaming (MRT) workflow that incorporates: (1) varying levels of agent and monitor situational awareness; (2) distinct adversarial strategies to evade the monitor, such as prompt injection; and (3) two datasets and environments -- SHADE-Arena for tool-calling agents and our new CUA-SHADE-Arena, which extends TheAgentCompany, for computer-use agents. We run MRT on existing LLM monitor scaffoldings, which orchestrate LLMs and parse agent trajectories, alongside a new hybrid hierarchical-sequential scaffolding proposed in this work. Our empirical results yield three key findings. First, agent awareness dominates monitor awareness: an agent's knowledge that it is being monitored substantially degrades the monitor's…
Peer Reviews
Decision·ICLR 2026 Oral
The paper makes several contributions, though of varying novelty. The systematic MRT framework provides useful formalization of threat models and evaluation protocols that were previously ad-hoc in the literature. CUA-SHADE-Arena extends monitoring evaluation to more realistic computer-use environments with bash commands and file operations. The hybrid monitoring architecture, while conceptually simple (linear combination of hierarchical and sequential approaches), represents a practical enginee
**Fundamentally unrealistic threat model** The core evaluation setup is problematic: agents are explicitly prompted with both tasks and told they're participating in red-team evaluation. As the authors acknowledge in Section 4.1: "we follow SHADE-Arena and frame the agent as assisting in a red-teaming setting." This means: - Agents are cooperative rather than adversarial - Misbehavior is prompted, not learned through training/fine-tuning - "Unaware" agents still know they're in an evaluation se
This paper takes a more scientific approach to agentic evaluation than most of those that I've read. Many papers simply build an agentic workflow or defense system without thoroughly probing it from a scientific standpoint. This paper includes proper statistical measures of success and asks actionable scientific questions. This paper includes error bars, which is minor but still very good. Several of the results are both important and interesting. For instance, the efficacy of different mon
The research questions are stated extremely late in the paper (line 252, page 5). I think it would be better to state these questions as motivations for the research at the beginning of the paper. These were some of the more interesting questions/findings that you came up with, so it weakens the paper to leave them so late in the draft. Some of your findings lack depth. For instance, you check whether Haiku can monitor sonnet, but can you characterize how monitoring capability scales with pa
* The paper is well-motivated, thorough, and well written. * The paper introduces a foundational framework for evaluating different agents, monitoring strategies, and evasion strategies. * The experimental setup is well thought out and the conclusions are sound with respect to the experimental results.
* The threat model presented in the paper is somewhat artificial and potentially unrealistic. One of the main assumptions made in the paper is that the agent is situationally aware. It is controversial whether or not this is a credible threat. Additionally, the agent must be prompted to perform different malicious actions. There is a lot of precedence in the AI evaluation space that this is ok, but it's not necessarily reflective of the true risks and capabilities of agents and monitors. * The e
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
