Output Supervision Can Obfuscate the Chain of Thought

Jacob Drori; Luke Marks; Bryce Woodworth; Alex Cloud; Alexander Matt Turner

arXiv:2511.11584·cs.LG·November 18, 2025

Output Supervision Can Obfuscate the Chain of Thought

Jacob Drori, Luke Marks, Bryce Woodworth, Alex Cloud, Alexander Matt Turner

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how output supervision can lead to obfuscated chains of thought in models, proposing mitigations that improve monitorability without sacrificing task performance.

Contribution

It reveals two mechanisms by which output supervision causes obfuscated CoTs and introduces two mitigation strategies that balance monitorability and task effectiveness.

Findings

01

Obfuscated CoTs can arise even with output-only training.

02

Proposed mitigations improve monitorability and task performance.

03

Mitigations achieve Pareto improvements over regular training.

Abstract

OpenAI (2025) showed that training against a chain of thought (CoT) monitor can cause obfuscated CoTs, which contain bad behavior the monitor cannot detect. They proposed to keep CoTs monitorable by training only against output monitors that do not have access to CoT. We show that such training can still cause obfuscated CoTs via two mechanisms. First, when a model is trained to produce a safe-looking output, that model may generalize to making its CoTs look safe. Second, since later tokens are conditioned on earlier ones, safe-looking CoTs may increase the likelihood of safe outputs, causing safe-looking CoTs to be reinforced. We introduce two mitigations to address these two issues, which achieve a Pareto improvement in terms of monitorability and task performance compared to regular training.

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

* **Novel insight**: Challenges the assumption that output-only monitoring preserves CoT monitorability. This is a significant contribution to alignment and oversight research. * **Clear theoretical framing**: The gradient decomposition and formal conditions for spillover (Eq. 5–7) are well-motivated and connect intuitively to the observed behaviors. * **Mitigation proposals**: Both strategies are conceptually simple yet grounded in the analysis. Reward targeting is especially appealing for its

Weaknesses

* **Mitigation trade-offs**: Reward targeting and Mind & Face often reduce spillover but can destabilize training or hurt task performance (e.g., QA with hints). This limits their practicality without further tuning. * **Limited model diversity**: Experiments use Qwen variants only; results may not generalize to other architectures or larger frontier models. * **Hyperparameter sensitivity**: Authors acknowledge instability but don’t provide systematic tuning or ablation studies, leaving uncertai

Reviewer 02Rating 6Confidence 2

Strengths

- Introduces the core concept of feedback spillover, which can cause Obfuscated CoT, - Monitoring the logical soundness of the CoT process as opposed to output only during training is quite novel. - Empirically showed that OS-Models have higher accuracy than CoT-Models, but suffer from invalid CoT monitoring.

Weaknesses

- The "Mind & Face" model using disjoint parameters to avoid obfuscation does make sense; however, I'm concerned on the computation expense and the possibility of error accumulation, which could harm accuracy or lose some of the supervision signals during training from either the CoT process or the outputs.

Reviewer 03Rating 8Confidence 4

Strengths

- The demonstration of feedback spillover seems original and significant. CoT monitorability is increasingly highlighted as a promising and important approach for safety. The existence and strength of the feedback spillover effect here is surprising. Naive safety approaches may not have otherwise considered the possibility of this effect. Developers should indeed take note of these results and follow the provided recommendations. I am excited to see follow-up work further investigating this phen

Weaknesses

The settings are narrow and somewhat toyish. There is a lack of focus on the extent to which the models reasoning is “load-bearing” - does the model require its reasoning to optimize task reward in any of these settings? This seems an important distinction to be aware of when interpreting the results. It would be valuable to compare results in environments where reasoning is and isn’t important for the task reward. Relatedly, do the authors think the reward targeting mitigation would significan

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMind wandering and attention · EEG and Brain-Computer Interfaces · Adversarial Robustness in Machine Learning