Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks

Nathaniel Mitrani Hadida; Sassan Bhanji; Cameron Tice; Puria Radmard

arXiv:2601.23086·cs.AI·May 21, 2026

Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks

Nathaniel Mitrani Hadida, Sassan Bhanji, Cameron Tice, Puria Radmard

PDF

TL;DR

This paper investigates how chain-of-thought reasoning in large language models can be obfuscated through output supervision, and how this obfuscation generalizes to unseen tasks, potentially reducing model transparency.

Contribution

It demonstrates that obfuscation learned from reward hacking scenarios can transfer across tasks and that penalizing final actions may unintentionally promote reasoning obfuscation.

Findings

01

Obfuscation of reasoning can generalize across different reward hacking settings.

02

Penalizing only final actions can lead to reasoning obfuscation.

03

Obfuscation reduces the interpretability and monitorability of LLMs.

Abstract

Chain-of-thought (CoT) reasoning provides a significant performance uplift to LLMs by enabling planning, exploration, and deliberation of their actions. CoT is also a powerful tool for monitoring the behaviours of these agents: when faithful, they offer interpretations of the model's decision making process, and an early warning sign for dangerous behaviours. However, optimisation pressures placed on the CoT may cause the model to obfuscate reasoning traces, losing this beneficial property. We show that obfuscation can generalise across tasks; models that learn to obfuscate reasoning involving reward hacking (e.g. accessing and utilising leaked information) generalise both the reward hacking behaviour and its obfuscation in CoT to unseen reward hacking settings. Most worryingly, we show that obfuscation of CoT reasoning, and its generalisation across tasks, also follows when we penalise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Embodied and Extended Cognition · Reinforcement Learning in Robotics