Reasoning Under Pressure: How do Training Incentives Influence Chain-of-Thought Monitorability?
Matt MacDermott, Qiyao Wei, Rada Djoneva, Francis Rhys Ward

TL;DR
This paper examines how different training incentives influence the ability to monitor AI reasoning processes, introducing a new measurement methodology and analyzing the effects of various incentives on monitorability.
Contribution
It presents a novel methodology for measuring monitorability of reasoning in AI models and analyzes how training incentives impact this monitorability.
Findings
Adversarial optimization degrades monitor performance.
Common incentives like length penalties and KL regularization show no consistent effect.
Direct optimization for monitorability does not reliably improve it.
Abstract
AI systems that output their reasoning in natural language offer an opportunity for safety -- we can \emph{monitor} their chain of thought (CoT) for undesirable reasoning, such as the pursuit of harmful objectives. However, the extent to which CoT faithfully reflects the underlying reasoning process, and hence the extent to which it can be usefully monitored, may be influenced by certain aspects of training. We investigate how different \emph{training incentives}, applied to a reasoning model, affect its monitorability. We introduce a novel methodology for measuring monitorability according to whether a monitor can predict a key latent variable using the model's reasoning. When controlling for accuracy, we do not find evidence for consistent effects from commonly used incentives (length penalties and KL regularisation), but we find that adversarial optimisation (penalising monitor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
