CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

Benjamin Arnav; Pablo Bernabeu-P\'erez; Nathan Helm-Burger; Tim Kostolansky; Hannes Whittingham; Mary Phuong

arXiv:2505.23575·cs.AI·November 26, 2025

CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

Benjamin Arnav, Pablo Bernabeu-P\'erez, Nathan Helm-Burger, Tim Kostolansky, Hannes Whittingham, Mary Phuong

PDF

Open Access 1 Video

TL;DR

This paper evaluates Chain-of-Thought monitoring for AI safety, revealing its strengths and limitations, and proposes a hybrid approach that significantly improves detection of harmful model behaviors.

Contribution

It introduces a hybrid monitoring protocol combining reasoning and action assessments, enhancing detection accuracy over existing methods.

Findings

01

CoT monitoring outperforms action-only monitoring in certain sabotage scenarios.

02

Reasoning traces can be misleading, reducing CoT effectiveness.

03

Hybrid monitoring achieves twice the detection rate of action-only methods.

Abstract

As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that while CoT monitoring is more effective than overseeing only model outputs in scenarios where action-only monitoring fails to reliably identify sabotage, reasoning traces can contain misleading rationalizations that deceive the CoT monitors, reducing performance in obvious sabotage cases. To address this, we introduce a hybrid…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring· slideslive

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Ethics and Social Impacts of AI