Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomek Korbak; Mikita Balesni; Elizabeth Barnes; Yoshua Bengio; Joe Benton; Joseph Bloom; Mark Chen; Alan Cooney; Allan Dafoe; Anca Dragan; Scott Emmons; Owain Evans; David Farhi; Ryan Greenblatt; Dan Hendrycks; Marius Hobbhahn; Evan Hubinger; Geoffrey Irving; Erik Jenner; Daniel Kokotajlo; Victoria Krakovna; Shane Legg; David Lindner; David Luan; Aleksander M\k{a}dry; Julian Michael; Neel Nanda; Dave Orr; Jakub Pachocki; Ethan Perez; Mary Phuong; Fabien Roger; Joshua Saxe; Buck Shlegeris; Mart\'in Soto; Eric Steinberger; Jasmine Wang; Wojciech Zaremba; Bowen Baker; Rohin Shah; Vlad Mikulik

arXiv:2507.11473·cs.AI·December 9, 2025·2 cites

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner

PDF

Open Access

TL;DR

This paper explores the potential and fragility of monitoring chains of thought in AI systems for safety purposes, emphasizing the need for further research and careful development considerations.

Contribution

It introduces the concept of CoT monitorability as a new safety opportunity and discusses its potential and vulnerabilities.

Findings

01

CoT monitoring can help detect misbehavior in AI systems.

02

CoT monitorability is fragile and can be affected by development choices.

03

Further research and careful development are recommended for effective CoT safety measures.

Abstract

AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications