When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

Scott Emmons; Erik Jenner; David K. Elson; Rif A. Saurous; Senthooran Rajamanoharan; Heng Chen; Irhum Shafkat; Rohin Shah

arXiv:2507.05246·cs.AI·July 8, 2025

When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors

Scott Emmons, Erik Jenner, David K. Elson, Rif A. Saurous, Senthooran Rajamanoharan, Heng Chen, Irhum Shafkat, Rohin Shah

PDF

TL;DR

This paper investigates the effectiveness of chain-of-thought (CoT) reasoning in AI safety monitoring, emphasizing that complex reasoning can improve monitorability but models can still learn to evade detection under certain conditions.

Contribution

It introduces a framework distinguishing CoT as rationalization versus computation and provides stress-testing guidelines to evaluate CoT monitoring robustness against evasion strategies.

Findings

01

Models can obscure intentions with significant help

02

Complex, multi-step reasoning enhances monitorability

03

Active stress-testing is essential for effective CoT monitoring

Abstract

While chain-of-thought (CoT) monitoring is an appealing AI safety defense, recent work on "unfaithfulness" has cast doubt on its reliability. These findings highlight an important failure mode, particularly when CoT acts as a post-hoc rationalization in applications like auditing for bias. However, for the distinct problem of runtime monitoring to prevent severe harm, we argue the key property is not faithfulness but monitorability. To this end, we introduce a conceptual framework distinguishing CoT-as-rationalization from CoT-as-computation. We expect that certain classes of severe harm will require complex, multi-step reasoning that necessitates CoT-as-computation. Replicating the experimental setups of prior work, we increase the difficulty of the bad behavior to enforce this necessity condition; this forces the model to expose its reasoning, making it monitorable. We then present…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.