A Pragmatic Way to Measure Chain-of-Thought Monitorability
Scott Emmons, Roland S. Zimmermann, David K. Elson, Rohin Shah

TL;DR
This paper introduces practical metrics and an autorater prompt to assess the monitorability of Chain-of-Thought reasoning in AI models, focusing on legibility and coverage to ensure reasoning transparency and completeness.
Contribution
It proposes a pragmatic, prompt-based method to measure CoT monitorability, enabling developers to evaluate and improve reasoning transparency in language models.
Findings
Frontier models exhibit high monitorability on tested benchmarks.
The autorater prompt effectively assesses legibility and coverage of CoTs.
Metrics can guide design decisions to enhance AI safety and transparency.
Abstract
While Chain-of-Thought (CoT) monitoring offers a unique opportunity for AI safety, this opportunity could be lost through shifts in training practices or model architecture. To help preserve monitorability, we propose a pragmatic way to measure two components of it: legibility (whether the reasoning can be followed by a human) and coverage (whether the CoT contains all the reasoning needed for a human to also produce the final output). We implement these metrics with an autorater prompt that enables any capable LLM to compute the legibility and coverage of existing CoTs. After sanity-checking our prompted autorater with synthetic CoT degradations, we apply it to several frontier models on challenging benchmarks, finding that they exhibit high monitorability. We present these metrics, including our complete autorater prompt, as a tool for developers to track how design decisions impact…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
