CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring
Benjamin Arnav, Pablo Bernabeu-P\'erez, Nathan Helm-Burger, Tim Kostolansky, Hannes Whittingham, Mary Phuong

TL;DR
This paper evaluates Chain-of-Thought monitoring for AI safety, revealing its strengths and limitations, and proposes a hybrid approach that significantly improves detection of harmful model behaviors.
Contribution
It introduces a hybrid monitoring protocol combining reasoning and action assessments, enhancing detection accuracy over existing methods.
Findings
CoT monitoring outperforms action-only monitoring in certain sabotage scenarios.
Reasoning traces can be misleading, reducing CoT effectiveness.
Hybrid monitoring achieves twice the detection rate of action-only methods.
Abstract
As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that while CoT monitoring is more effective than overseeing only model outputs in scenarios where action-only monitoring fails to reliably identify sabotage, reasoning traces can contain misleading rationalizations that deceive the CoT monitors, reducing performance in obvious sabotage cases. To address this, we introduce a hybrid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Ethics and Social Impacts of AI
