Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability
Artur Zolkowski, Wen Xing, David Lindner, Florian Tram\`er, Erik Jenner

TL;DR
This paper investigates whether reasoning models can hide their reasoning processes to evade detection, revealing that models can obfuscate external reasoning traces under pressure but not internal reasoning, emphasizing the need for stress-testing monitorability.
Contribution
The paper develops a taxonomy and evaluation framework to stress-test the ability of Chain-of-Thought (CoT) models to obfuscate reasoning, highlighting vulnerabilities in external CoT monitoring.
Findings
Models can evade detection by obfuscating external CoT under pressure
Internal CoT remains less obfuscated than external CoT in adversarial settings
Robust deployment requires stress-testing models for monitorability
Abstract
Recent findings suggest that misaligned models may exhibit deceptive behavior, raising concerns about output trustworthiness. Chain-of-thought (CoT) is a promising tool for alignment monitoring: when models articulate their reasoning faithfully, monitors can detect and mitigate harmful behaviors before undesirable outcomes occur. However, a key uncertainty is: Can models obfuscate their CoT in order to pursue hidden adversarial objectives while evading detection? To answer this question and thus stress-test CoT monitorability, we develop a composable and quantifiable taxonomy of prompts to elicit CoT obfuscation. We evaluate both internal CoT (reasoning traces) and external CoT (prompted reasoning in outputs) using toy tasks and more realistic environments in SHADE-Arena. We show that: (i) CoT monitoring performs accurately and efficiently without obfuscation pressure. (ii) Under strong…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
* The paper addresses a highly relevant safety question: Can reasoning models deliberately obfuscate their CoT to evade monitoring? This is timely given the growing reliance on CoT for alignment oversight. The introduction frames the problem well and situates it within recent literature on deception and monitorability. * Metrics like conditional obfuscation and obfuscated success are well-defined and isolate deliberate concealment from trivial failures. * Strong evidence that external CoT is mo
* Novelty is narrower than claimed: * Fragility of CoT monitoring and model-dependent vulnerabilities were shown in prior work (Emmons et al., Arnav et al., Hubinger et al.). * The fourth contribution (“pressure-dependent vulnerability”) is not new (Baker et al 2025, OpenAI 2025). * Most interesting result is the internal vs. external CoT gap; other findings largely confirm known fragility. * Composable taxonomy contribution is under-supported in the main text: * Results are buried i
- This work provides practical insights on the relevant problem of CoT monitor evasion via adversarial prompting - Relatively realistic experimental setting, involving multi-turn interactions on SHADE-arena. - The separate analysis of internal and external CoT was interesting - The paper was generally well written and easy to follow
- Some of the results and observations can be seen as direct consequences of the works of (Emmons et al., 2025) and (Baker et al., 2025), limiting the novelty. More specifically, Baker et al., 2025 already demonstrated that training against a monitor can lead to obfuscation (and prompting can be seen as a simple way to "train" a model to obfuscate without changing the model parameters) and Emmons et al., 2025 demonstrated that when the CoT is necessary for the task, it is harder to obfuscate (an
1. CoT monitoring is an important area of AI safety, and CoT obfuscation a timely issue. 2. The paper considers different levels of obfuscation pressure (small clues to detailed obfuscation instructions). 3. The SHADE arena tasks resemble realistic risks (one could imagine the money task happening in reality via a prompt injection on an AI agent email assistant).
1. The most important weakness is that prompted obfuscation "pressure" may not be analogous to misalignment and scheming which the paper discusses in the introduction. As the discussion section acknowledges, more realistic settings (e.g. fine-tuning, maybe reinforcement learning against the CoT) could produce vastly different outcomes. 2. The paper's results on internal vs external CoT seem to be largely explained by the observation that internal CoT follows reasoning instructions less reliably
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Embodied and Extended Cognition · Autonomous Vehicle Technology and Safety
