Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?
Max Kaufmann, David Lindner, Roland S. Zimmermann, and Rohin Shah

TL;DR
This paper presents a framework to predict when training language models affects the monitorability of their reasoning processes, based on the alignment of reward components in a reinforcement learning setting.
Contribution
It introduces a conceptual framework classifying reward terms as aligned, orthogonal, or in-conflict, and empirically validates how these classifications impact CoT monitorability.
Findings
Training with in-conflict reward terms reduces monitorability.
Optimizing in-conflict reward terms is challenging.
The framework accurately predicts effects of reward alignment on monitorability.
Abstract
Chain-of-Thought (CoT) monitoring, in which automated systems monitor the CoT of an LLM, is a promising approach for effectively overseeing AI systems. However, the extent to which a model's CoT helps us oversee the model - the monitorability of the CoT - can be affected by training, for instance by the model learning to hide important features of its reasoning. We propose and empirically validate a conceptual framework for predicting when and why this occurs. We model LLM post-training as an RL environment where the reward decomposes into two terms: one term depending on final outputs and another term depending on the CoT. Our framework allows us to classify these two terms as "aligned", "orthogonal", or "in-conflict" before training. We predict that training with in-conflict terms will reduce monitorability, orthogonal terms will not affect it, and aligned terms will improve it. To…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
