Chain-of-Thought Hijacking
Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez

TL;DR
This paper reveals that extended reasoning sequences in large reasoning models can be exploited to weaken safety mechanisms through a novel attack called Chain-of-Thought Hijacking, highlighting a systematic vulnerability.
Contribution
It introduces Chain-of-Thought Hijacking, a new jailbreak attack exploiting reasoning sequences to bypass safety in large reasoning models.
Findings
High success rates of attack across multiple models.
Safety signals become diluted as reasoning lengthens.
Refusal depends on low-dimensional safety signals in model layers.
Abstract
Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. While prior work suggests this should strengthen safety, we find evidence to the contrary. Long reasoning sequences can be exploited to systematically weaken them. We introduce Chain-of-Thought Hijacking, a jailbreak attack that prepends harmful instructions with extended sequences of benign puzzle reasoning. Across HarmBench, CoT Hijacking achieves attack success rates of 99\%, 94\%, 100\%, and 94\% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet. To understand this mechanism, we apply activation probing, attention analysis, and causal interventions. We find that refusal depends on a low-dimensional safety signal that becomes diluted as reasoning grows: mid-layers encode the strength of safety checking, while late layers encode the refusal outcome. These findings…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- The paper proposes a simple jailbreak attack that can achieve high ASRs on strong proprietary reasoning models, raising safety concerns. - The paper extends prior studies on understanding the mechanisms of refusal/harmful behaviors in base models to reasoning models, showing that reasoning models similarly exhibit a refusal direction.
**1. Overall:** While this work presents a range of experiments, I find it difficult to connect the dots and see how they support a central, significant claim. There are many individual experiments but insufficient details or motivation for them. For example, the authors introduced the concept of “refusal dilution” in section 5.4, but it is never elaborated or connected to section 6. Another example is the refusal direction experiments in section 5 - how are they related to your jailbreak metho
The paper is well-scoped, and experiments are well designed and executed to back the author's claims. In particular, a 99% success rate is compelling. The overall narrative is coherent, starting from demonstrating a new phenomena (99% jailbreak success rate with scaled contexts) to explaining the underlying mechanisms.
As indicated in the summary, my main criticism is that the reported phenomena is not new. Li et al. (https://arxiv.org/pdf/2402.10962) as demonstrated this phenomena already, and there are many parallels (instruction drift vs. refusal drift) stemming from the same underlying reasons/mechanisms (less attention being spent on system prompts/instructions vs. toxic tokens). Put differently, the paper carves out a neat narrative and sheds light to a new vulnerability of language models, but to be hon
The paper contrasts the attack vector against prior work H-CoT, which requires exposed safety reasoning. The experimental methodology also uses comprehensive evaluations across multiple frontier models (Gemini, ChatGPT, Grok, Claude), showing that their attack attains high success rate. I think the work is well structured, especially on how the mechanistic analysis builds incrementally. The mechanistic analysis connecting refusal directions, attention patterns, and causal interventions provides
First of all, the paper frames puzzles as "benign reasoning," but this characterization is quite questionable in my opinion. Given that the paper's core claim is that CoT length dilutes safety signals through attention mechanisms, it should include an ablation study that takes well-known benign reframing attacks such as persuasion attack [1] and explore how length changes affect the ASR. In my opinion, this attack works *because of the prompt rather than a general length effect*. Furthermore, no
The authors address a timely and practically important issue: the ease with which reasoning-focused models can be jailbroken. Their proposed method is conceptually simple, likely simple enough that many users could independently discover and exploit similar strategies. This makes it especially relevant for model providers, as defending against such benign but effective jailbreaks poses a serious and ongoing challenge. The approach seems to work well on harmbench and on closed models. It is com
The overall quality of the work feels closer to a class project report or blog post than to a polished research contribution. Much of the presentation gives the impression of filling space rather than conveying substance. The paper makes excessive use of two-row tables (Tables 1, 3, 4, 5) and includes numerous large but only marginally informative figures, many of which appear redundant. In addition, the visual presentation suffers from poor readability, particularly due to the extremely small f
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Topic Modeling
