Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory
Usman Anwar, Tim Bakker, Dana Kianfar, Cristina Pinneri, Christos Louizos

TL;DR
This paper uses information theory to analyze the monitorability of chain-of-thought reasoning in language models, identifying key factors affecting performance and proposing training methods to enhance monitor accuracy and robustness.
Contribution
It introduces an information-theoretic framework for understanding CoT monitorability and proposes two training strategies to improve monitor performance and prevent reward hacking.
Findings
Mutual information is necessary but not sufficient for monitorability.
Two sources of error—information gap and elicitation error—affect monitor performance.
Proposed training methods significantly improve monitor accuracy across environments.
Abstract
Chain-of-thought (CoT) monitors are LLM-based systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest, such as test-hacking behavior during code generation. In this paper, we use information-theoretic analysis to show that non-zero mutual information between CoT and output is a necessary but not sufficient condition for CoT monitorability. We identify two sources of approximation error that may undermine the performance of CoT monitors in practice: information gap, which measures the extent to which the monitor can extract the information available in CoT, and elicitation error, which measures the extent to which the monitor approximates the optimal monitoring function. We further demonstrate that CoT monitorability can be systematically improved through targeted training objectives. To this end, we propose two complementary approaches: (a) an…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- The scope and direction of the paper is interesting and timely: understanding how to reliably monitor LLM reasoning is a critical, unsolved problem. - The analysis leads to a practical label-free reward that appears to lead to modest improvements in monitorability (+ resistance to hacking when paired with monitoring reward + task reward). - A precise language for discussing monitorability (and formalizing its necessary conditions) can be valuable for the community (however, the setup is arguab
- The theoretical argument is nicely done, but it’s unclear how much new insight it offers (e.g., it is intuitively obvious that a CoT must contain information relevant to the output + attribute of interest to be useful for monitoring). - The paper proves that maximizing I(O, Z | X) is not sufficient, but this is not reflected in the proposed reward. What stops the model from gaming this reward by simply outputting the final answer in the thought? - It’s not that obvious how well the MI reward
1) The paper’s empirical results are interesting, and Figure 2 does a good job at communicating the value of the mutual information reward in the given environment. 2) The discussion of steganography in Appendix A, and the empirical identification of examples, is interesting (although it also highlights the limitations of the paper’s theoretical justifications for optimizing MI).
1) The paper’s entire theoretical framework makes a set of assumptions which, IIUC, **rule out the value of CoT monitoring a-priori**. In particular, sections 3.1-3.2 assume that: A) Z (CoT) can only influence Y (attribute being monitored) via O (the model’s action). B) The monitor is bayes-optimal. But in this setting, it is *theoretically impossible* for the CoT to provide information to the monitor beyond that contained in the model’s action. Why bother monitoring the CoT at all, rather th
- The contributed information-theoretic framework for CoT monitorability adds a novel and potentially useful and clarifying perspective to the field. Formalizing the importance of mutual information between CoT and outputs is useful for thinking about the problem. Similarly, concretely and formally identifying two distinct sources of CoT monitoring errors is useful for designing mitigations/improvements. - The analyses have led to a novel and practical mitigation in the form of the proposed mutu
**Limited experiments.** The current set of experiments feel very limited in breadth and depth. This makes it hard for me to gauge how promising the MI reward truly is. I will outline additional experiments I would like to see below. Ideally, this paper would add a more thorough set of empirical results. **Theory yields somewhat expected takeaways.** While the provided theory is well executed and presented, the main takeaways from this theory are perhaps not overly surprising or insightful. I
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Software Engineering Research · Advanced Software Engineering Methodologies
