To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models

Zihao Zhu; Hongbao Zhang; Ruotong Wang; Ke Xu; Siwei Lyu; Baoyuan Wu

arXiv:2502.12202·cs.CL·May 20, 2025

To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models

Zihao Zhu, Hongbao Zhang, Ruotong Wang, Ke Xu, Siwei Lyu, Baoyuan Wu

PDF

Open Access 2 Repos 5 Models 3 Reviews

TL;DR

This paper uncovers a widespread vulnerability in large reasoning models where their reasoning process can be bypassed, leading to security risks, and proposes methods for both attack and defense to improve model robustness.

Contribution

It introduces the Unthinking Vulnerability in LRMs, along with novel attack (BoT) and defense (MoT) techniques, highlighting a critical flaw and potential solutions in reasoning models.

Findings

01

BoT can effectively bypass reasoning in LRMs

02

MoT can detect and prevent overthinking and jailbreaking

03

Vulnerability is prevalent across mainstream LRMs

Abstract

Large Reasoning Models (LRMs) are designed to solve complex tasks by generating explicit reasoning traces before producing final answers. However, we reveal a critical vulnerability in LRMs -- termed Unthinking Vulnerability -- wherein the thinking process can be bypassed by manipulating special delimiter tokens. It is empirically demonstrated to be widespread across mainstream LRMs, posing both a significant risk and potential utility, depending on how it is exploited. In this paper, we systematically investigate this vulnerability from both malicious and beneficial perspectives. On the malicious side, we introduce Breaking of Thought (BoT), a novel attack that enables adversaries to bypass the thinking process of LRMs, thereby compromising their reliability and availability. We present two variants of BoT: a training-based version that injects backdoor during the fine-tuning stage,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

- The authors conducted systematic analysis of LRMs’ reasoning behaviors under various settings. - The authors proposed a remedy to LRMs’ unthink behavior using data augmentation. Although the method is only effective in inference time, it is still a potential direction for future research. - Although it is unclear to me as to which model is used as the monitor, the MoT framework shows promising results in reasoning tasks (Table 5) where LRMs’ reasoning capabilities are preserved while reducin

Weaknesses

- Many citations and references are missing. ? in the citation. - It is unclear what are the models used for evaluation. In the paper, the authors mentioned models such as “DeepSeek-R1-1.5B” while Deepseek-R1 only has one variant which has 685B parameters. I suspect that the authors are referring to the “Deepseek-R1-Distill-Qwen” models but fail to properly acknowledge the model names. - Most Open-sourced LRMs allow users to compress the thinking process, which is achieved by either adding a s

Reviewer 02Rating 4Confidence 4

Strengths

1. The "unthinking vulnerability" exposes a new failure mode in reasoning models distinct from traditional jailbreak or backdoor paradigms, and using the GCG method to disable reasoning is interesting. 2. Evaluations across multiple LRMs, tasks (AIME, MATH-500), and settings (white/black-box, training-based/inference-time) are comprehensive, with quantitative metrics such as ASR, RTC, and RPC. Figures and tables are well-structured and make the results easy to interpret.

Weaknesses

1. Missing accuracy reporting for backdoor evaluation: The C-Acc results for BoT only evaluate whether the model induces the full thinking process on clean samples, but do not evaluate accuracy. In my understanding, in this backdoor scenario, verifying whether the model's performance remains intact should involve checking for any accuracy degradation on the clean inputs of AIME and MATH500, not merely confirming that the reasoning process still exists. The paper overlooks this aspect and does no

Reviewer 03Rating 2Confidence 4

Strengths

- The paper is the first to explicitly formalize a delimiter-driven unthinking vulnerability in Large Reasoning Models (LRMs), providing a clear conceptual framework and empirical evidence for how simple token manipulations can disable structured reasoning. - It examines both the vulnerability (via the Breaking of Thought attacks) and a simple solution (Monitoring of Thought), showing improvements in enhancing safety. - Both BoT and MoT are lightweight, architecture-agnostic, and easy to reprodu

Weaknesses

- The core phenomenon, appending thought delimiters to suppress reasoning, is straightforward and somewhat expected, reflecting a token-level control misalignment rather than a deeper architectural flaw. The contribution lies more in the systematic evaluation than in theoretical novelty. - The paper does not provide a rigorous explanation of why autoregressive likelihood dynamics lead the model to interpret delimiters as indicators of completed reasoning. A probabilistic or representational ana

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Digital and Cyber Forensics