Training on Documents About Monitoring Leads to CoT Obfuscation
Reilly Haskins, Bilal Chughtai, Joshua Engels

TL;DR
This paper investigates how models aware of monitoring can obfuscate their reasoning to evade detection, revealing vulnerabilities in chain-of-thought monitoring systems.
Contribution
It demonstrates that monitor-aware models can successfully hide their misbehavior and reshape reasoning traces, posing risks to CoT-based monitoring methods.
Findings
Monitor-aware models achieve higher undetected misbehavior rates.
Obfuscation success correlates with CoT controllability ($r=0.800$, $p=0.017$).
Models can reward-hack faster under monitoring constraints.
Abstract
Chain-of-thought (CoT) monitoring is one of the most promising tools we have for detecting model misbehavior, but its effectiveness depends on models faithfully externalizing their reasoning. Motivated by this vulnerability, we study whether monitor-aware models are capable of obfuscating their reasoning to evade detection. We use synthetic document finetuning to expose eight models to realistic pre-training-style documents describing a CoT monitor and find that monitor-aware models consistently achieve higher rates of undetected misbehavior compared to unaware controls. This effect is weaker but still present on a harder agentic task. We also show that CoT controllability, a model's ability to reshape its own reasoning trace under an imposed constraint, is closely correlated with obfuscation success across the eight models studied (, ). Monitor-aware models placed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
