Training on Documents About Monitoring Leads to CoT Obfuscation

Reilly Haskins; Bilal Chughtai; Joshua Engels

arXiv:2605.15257·cs.LG·May 18, 2026

Training on Documents About Monitoring Leads to CoT Obfuscation

Reilly Haskins, Bilal Chughtai, Joshua Engels

PDF

TL;DR

This paper investigates how models aware of monitoring can obfuscate their reasoning to evade detection, revealing vulnerabilities in chain-of-thought monitoring systems.

Contribution

It demonstrates that monitor-aware models can successfully hide their misbehavior and reshape reasoning traces, posing risks to CoT-based monitoring methods.

Findings

01

Monitor-aware models achieve higher undetected misbehavior rates.

02

Obfuscation success correlates with CoT controllability ($r=0.800$, $p=0.017$).

03

Models can reward-hack faster under monitoring constraints.

Abstract

Chain-of-thought (CoT) monitoring is one of the most promising tools we have for detecting model misbehavior, but its effectiveness depends on models faithfully externalizing their reasoning. Motivated by this vulnerability, we study whether monitor-aware models are capable of obfuscating their reasoning to evade detection. We use synthetic document finetuning to expose eight models to realistic pre-training-style documents describing a CoT monitor and find that monitor-aware models consistently achieve higher rates of undetected misbehavior compared to unaware controls. This effect is weaker but still present on a harder agentic task. We also show that CoT controllability, a model's ability to reshape its own reasoning trace under an imposed constraint, is closely correlated with obfuscation success across the eight models studied ( $r = 0.800$ , $p = 0.017$ ). Monitor-aware models placed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.