How does information access affect LLM monitors' ability to detect sabotage?

Rauno Arike; Raja Mehta Moreno; Rohan Subramani; Shubhorup Biswas; Francis Rhys Ward

arXiv:2601.21112·cs.AI·February 9, 2026

How does information access affect LLM monitors' ability to detect sabotage?

Rauno Arike, Raja Mehta Moreno, Rohan Subramani, Shubhorup Biswas, Francis Rhys Ward

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how the amount of information accessible to LLM monitors influences their ability to detect sabotage, revealing a counterintuitive less-is-more effect and proposing a hierarchical extract-and-evaluate monitoring approach that improves detection performance.

Contribution

It introduces the less-is-more phenomenon in LLM oversight and proposes the EaE hierarchical monitoring method, demonstrating significant improvements in sabotage detection across multiple environments.

Findings

01

Less information access can improve monitor performance.

02

EaE monitoring enhances sabotage detection by 16.8 percentage points.

03

Filtering information makes detection more effective and cost-efficient.

Abstract

Frontier language model agents can exhibit misaligned behaviors, including deception, exploiting reward hacks, and pursuing hidden objectives. To control potentially misaligned agents, we can use LLMs themselves to monitor for misbehavior. In this paper, we study how information access affects LLM monitor performance. While one might expect that monitors perform better when they have access to more of the monitored agents' reasoning and actions, we find that contemporary systems often perform better with less information, a phenomenon we call the less-is-more effect for automated oversight. We demonstrate this phenomenon, and analyze the conditions under which it occurs, in three evaluation environments where agents must conduct sabotage while evading monitors. Leveraging this effect, we introduce extract-and-evaluate (EaE) monitoring -- a new hierarchical approach where one monitor…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

A clear empirical story around information access. The paper asks a practical question: "what should your monitor actually look at?" and answers it with cross‑environment evidence that naively giving monitors everything (CoT+actions) is often suboptimal, with CoT‑only or extracted subsets performing better. This is useful guidance for safety teams building pipeline monitors today. --- A simple, actionable protocol (EaE) with measurable gains. EaE operationalizes the less‑is‑more intuition via

Weaknesses

**Contribution framing**: Beyond EaE, other “contributions” (awareness, cost analysis, reasoning vs. traditional LLMs) feel like observations rather than core advances. **CoT-only novelty claim**: The authors state that their CoT-only results are novel because prior work (e.g., Arnav et al., 2025) focused on action-only vs. CoT+action monitors. This seems fair, but the claim could be clearer: CoT monitoring itself is well-known, so the novelty lies in the systematic evaluation across environmen

Reviewer 02Rating 6Confidence 4

Strengths

* The main contribution of the paper is valuable to the AI control community: "naively applying monitors on the full agent trajectory is often insufficient to achieve optimal monitor performance." * EaE seems to provide a Pareto improvement (performance / cost) over other monitoring strategies. * The paper is very well-written and the presentation is overall well-done.

Weaknesses

* Do not run against adaptive agents (i.e. where the agent can optimize itself knowing that it will be monitored). This makes the evaluation setup unrealistic in comparison to actual deployment scenarios. Figure 3 shows that when the agent knows more about the monitoring environment (e.g. SHADE-Arena) the monitoring performance decreases dramatically, and the proposed approaches provide much less benefit (especially when using a weaker monitor). This indicates that the proposed approaches will n

Reviewer 03Rating 4Confidence 3

Strengths

1. The abstract and introduction are clearly written, and the paper’s core claim about the less-is-more effect is easy to follow. 2. The proposed approach, extract-and-evaluate, is simple and well-motivated in light of the less-is-more finding. 3. Beyond the core claims about less-is-more and EaE, Section 5 gives a number of interesting empirical findings from the conducted experiments. I think these will be useful to researchers working on LLM monitoring. 4. The studied task, monitoring for sab

Weaknesses

1. “extract-and-evaluate monitoring technique that improves sabotage detection rates in the BigCodeBenchsabotage environment by up to 12 percentage points over the next-best monitor without increasing the false positive rate" - this feels like a misleadingly cherry-picked presentation of evidence in the abstract. EaE performs at a statistically indistinguishable level in many settings in Figure 3, and underperforms CoT only in BigBench traditional attacker/GPT-4.1. Without an a-priori reason to

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Explainable Artificial Intelligence (XAI) · Topic Modeling