Chain-of-Thought Hijacking

Jianli Zhao; Tingchen Fu; Rylan Schaeffer; Mrinank Sharma; Fazl Barez

arXiv:2510.26418·cs.AI·February 4, 2026

Chain-of-Thought Hijacking

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez

PDF

Open Access 4 Reviews

TL;DR

This paper reveals that extended reasoning sequences in large reasoning models can be exploited to weaken safety mechanisms through a novel attack called Chain-of-Thought Hijacking, highlighting a systematic vulnerability.

Contribution

It introduces Chain-of-Thought Hijacking, a new jailbreak attack exploiting reasoning sequences to bypass safety in large reasoning models.

Findings

01

High success rates of attack across multiple models.

02

Safety signals become diluted as reasoning lengthens.

03

Refusal depends on low-dimensional safety signals in model layers.

Abstract

Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. While prior work suggests this should strengthen safety, we find evidence to the contrary. Long reasoning sequences can be exploited to systematically weaken them. We introduce Chain-of-Thought Hijacking, a jailbreak attack that prepends harmful instructions with extended sequences of benign puzzle reasoning. Across HarmBench, CoT Hijacking achieves attack success rates of 99\%, 94\%, 100\%, and 94\% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet. To understand this mechanism, we apply activation probing, attention analysis, and causal interventions. We find that refusal depends on a low-dimensional safety signal that becomes diluted as reasoning grows: mid-layers encode the strength of safety checking, while late layers encode the refusal outcome. These findings…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 2Confidence 4

Strengths

- The paper proposes a simple jailbreak attack that can achieve high ASRs on strong proprietary reasoning models, raising safety concerns. - The paper extends prior studies on understanding the mechanisms of refusal/harmful behaviors in base models to reasoning models, showing that reasoning models similarly exhibit a refusal direction.

Weaknesses

**1. Overall:** While this work presents a range of experiments, I find it difficult to connect the dots and see how they support a central, significant claim. There are many individual experiments but insufficient details or motivation for them. For example, the authors introduced the concept of “refusal dilution” in section 5.4, but it is never elaborated or connected to section 6. Another example is the refusal direction experiments in section 5 - how are they related to your jailbreak metho

Reviewer 02Rating 2Confidence 4

Strengths

The paper is well-scoped, and experiments are well designed and executed to back the author's claims. In particular, a 99% success rate is compelling. The overall narrative is coherent, starting from demonstrating a new phenomena (99% jailbreak success rate with scaled contexts) to explaining the underlying mechanisms.

Weaknesses

As indicated in the summary, my main criticism is that the reported phenomena is not new. Li et al. (https://arxiv.org/pdf/2402.10962) as demonstrated this phenomena already, and there are many parallels (instruction drift vs. refusal drift) stemming from the same underlying reasons/mechanisms (less attention being spent on system prompts/instructions vs. toxic tokens). Put differently, the paper carves out a neat narrative and sheds light to a new vulnerability of language models, but to be hon

Reviewer 03Rating 4Confidence 5

Strengths

The paper contrasts the attack vector against prior work H-CoT, which requires exposed safety reasoning. The experimental methodology also uses comprehensive evaluations across multiple frontier models (Gemini, ChatGPT, Grok, Claude), showing that their attack attains high success rate. I think the work is well structured, especially on how the mechanistic analysis builds incrementally. The mechanistic analysis connecting refusal directions, attention patterns, and causal interventions provides

Weaknesses

First of all, the paper frames puzzles as "benign reasoning," but this characterization is quite questionable in my opinion. Given that the paper's core claim is that CoT length dilutes safety signals through attention mechanisms, it should include an ablation study that takes well-known benign reframing attacks such as persuasion attack [1] and explore how length changes affect the ASR. In my opinion, this attack works *because of the prompt rather than a general length effect*. Furthermore, no

Reviewer 04Rating 2Confidence 4

Strengths

The authors address a timely and practically important issue: the ease with which reasoning-focused models can be jailbroken. Their proposed method is conceptually simple, likely simple enough that many users could independently discover and exploit similar strategies. This makes it especially relevant for model providers, as defending against such benign but effective jailbreaks poses a serious and ongoing challenge. The approach seems to work well on harmbench and on closed models. It is com

Weaknesses

The overall quality of the work feels closer to a class project report or blog post than to a polished research contribution. Much of the presentation gives the impression of filling space rather than conveying substance. The paper makes excessive use of two-row tables (Tables 1, 3, 4, 5) and includes numerous large but only marginally informative figures, many of which appear redundant. In addition, the visual presentation suffers from poor readability, particularly due to the extremely small f

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Topic Modeling