Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks
Nathaniel Mitrani Hadida, Sassan Bhanji, Cameron Tice, Puria Radmard

TL;DR
This paper investigates how chain-of-thought reasoning in large language models can be obfuscated through output supervision, and how this obfuscation generalizes to unseen tasks, potentially reducing model transparency.
Contribution
It demonstrates that obfuscation learned from reward hacking scenarios can transfer across tasks and that penalizing final actions may unintentionally promote reasoning obfuscation.
Findings
Obfuscation of reasoning can generalize across different reward hacking settings.
Penalizing only final actions can lead to reasoning obfuscation.
Obfuscation reduces the interpretability and monitorability of LLMs.
Abstract
Chain-of-thought (CoT) reasoning provides a significant performance uplift to LLMs by enabling planning, exploration, and deliberation of their actions. CoT is also a powerful tool for monitoring the behaviours of these agents: when faithful, they offer interpretations of the model's decision making process, and an early warning sign for dangerous behaviours. However, optimisation pressures placed on the CoT may cause the model to obfuscate reasoning traces, losing this beneficial property. We show that obfuscation can generalise across tasks; models that learn to obfuscate reasoning involving reward hacking (e.g. accessing and utilising leaked information) generalise both the reward hacking behaviour and its obfuscation in CoT to unseen reward hacking settings. Most worryingly, we show that obfuscation of CoT reasoning, and its generalisation across tasks, also follows when we penalise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Embodied and Extended Cognition · Reinforcement Learning in Robotics
