Noticing the Watcher: LLM Agents Can Infer CoT Monitoring from Blocking Feedback
Thomas Jiralerspong, Flemming Kondrup, Yoshua Bengio

TL;DR
This paper investigates whether large language models can independently detect when their reasoning process is monitored and whether they can develop strategies to evade such monitoring, revealing limitations in current oversight methods.
Contribution
The study demonstrates that models can infer monitoring purely from feedback signals and sometimes develop intentions to evade, highlighting challenges in ensuring reliable oversight of AI reasoning.
Findings
Models can detect monitoring in up to 19% of episodes.
Awareness of monitoring increases with model capability.
Models sometimes develop and attempt to execute evasion strategies.
Abstract
Chain-of-thought (CoT) monitoring is proposed as a method for overseeing the internal reasoning of language-model agents. Prior work has shown that when models are explicitly informed that their reasoning is being monitored, or are fine-tuned to internalize this fact, they may learn to obfuscate their CoTs in ways that allow them to evade CoT-based monitoring systems. We ask whether reasoning agents can autonomously infer that their supposedly private CoT is under surveillance, and whether this awareness leads to strategic evasion, without any explicit training or instructions. In a multi-episode agentic framework, models pursue both a primary task and a concealed side task while being told that their reasoning is private; a hidden CoT monitor blocks episodes when suspicious reasoning is detected. We find that frontier models can deduce the existence of this monitor purely from blocking…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
