Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Artur Zolkowski; Wen Xing; David Lindner; Florian Tram\`er; Erik Jenner

arXiv:2510.19851·cs.CR·October 24, 2025

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Artur Zolkowski, Wen Xing, David Lindner, Florian Tram\`er, Erik Jenner

PDF

Open Access 3 Reviews

TL;DR

This paper investigates whether reasoning models can hide their reasoning processes to evade detection, revealing that models can obfuscate external reasoning traces under pressure but not internal reasoning, emphasizing the need for stress-testing monitorability.

Contribution

The paper develops a taxonomy and evaluation framework to stress-test the ability of Chain-of-Thought (CoT) models to obfuscate reasoning, highlighting vulnerabilities in external CoT monitoring.

Findings

01

Models can evade detection by obfuscating external CoT under pressure

02

Internal CoT remains less obfuscated than external CoT in adversarial settings

03

Robust deployment requires stress-testing models for monitorability

Abstract

Recent findings suggest that misaligned models may exhibit deceptive behavior, raising concerns about output trustworthiness. Chain-of-thought (CoT) is a promising tool for alignment monitoring: when models articulate their reasoning faithfully, monitors can detect and mitigate harmful behaviors before undesirable outcomes occur. However, a key uncertainty is: Can models obfuscate their CoT in order to pursue hidden adversarial objectives while evading detection? To answer this question and thus stress-test CoT monitorability, we develop a composable and quantifiable taxonomy of prompts to elicit CoT obfuscation. We evaluate both internal CoT (reasoning traces) and external CoT (prompted reasoning in outputs) using toy tasks and more realistic environments in SHADE-Arena. We show that: (i) CoT monitoring performs accurately and efficiently without obfuscation pressure. (ii) Under strong…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 6Confidence 4

Strengths

* The paper addresses a highly relevant safety question: Can reasoning models deliberately obfuscate their CoT to evade monitoring? This is timely given the growing reliance on CoT for alignment oversight. The introduction frames the problem well and situates it within recent literature on deception and monitorability. * Metrics like conditional obfuscation and obfuscated success are well-defined and isolate deliberate concealment from trivial failures. * Strong evidence that external CoT is mo

Weaknesses

* Novelty is narrower than claimed: * Fragility of CoT monitoring and model-dependent vulnerabilities were shown in prior work (Emmons et al., Arnav et al., Hubinger et al.). * The fourth contribution (“pressure-dependent vulnerability”) is not new (Baker et al 2025, OpenAI 2025). * Most interesting result is the internal vs. external CoT gap; other findings largely confirm known fragility. * Composable taxonomy contribution is under-supported in the main text: * Results are buried i

Reviewer 02Rating 4Confidence 4

Strengths

- This work provides practical insights on the relevant problem of CoT monitor evasion via adversarial prompting - Relatively realistic experimental setting, involving multi-turn interactions on SHADE-arena. - The separate analysis of internal and external CoT was interesting - The paper was generally well written and easy to follow

Weaknesses

- Some of the results and observations can be seen as direct consequences of the works of (Emmons et al., 2025) and (Baker et al., 2025), limiting the novelty. More specifically, Baker et al., 2025 already demonstrated that training against a monitor can lead to obfuscation (and prompting can be seen as a simple way to "train" a model to obfuscate without changing the model parameters) and Emmons et al., 2025 demonstrated that when the CoT is necessary for the task, it is harder to obfuscate (an

Reviewer 03Rating 2Confidence 4

Strengths

1. CoT monitoring is an important area of AI safety, and CoT obfuscation a timely issue. 2. The paper considers different levels of obfuscation pressure (small clues to detailed obfuscation instructions). 3. The SHADE arena tasks resemble realistic risks (one could imagine the money task happening in reality via a prompt injection on an AI agent email assistant).

Weaknesses

1. The most important weakness is that prompted obfuscation "pressure" may not be analogous to misalignment and scheming which the paper discusses in the introduction. As the discussion section acknowledges, more realistic settings (e.g. fine-tuning, maybe reinforcement learning against the CoT) could produce vastly different outcomes. 2. The paper's results on internal vs external CoT seem to be largely explained by the observation that internal CoT follows reasoning instructions less reliably

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Embodied and Extended Cognition · Autonomous Vehicle Technology and Safety