Thought Purity: A Defense Framework For Chain-of-Thought Attack

Zihao Xue; Zhen Bi; Long Ma; Zhenlin Hu; Yan Wang; Xueshu Chen; Zhenfang Liu; Kang Zhao; Jie Xiao; Jungang Lou

arXiv:2507.12314·cs.LG·February 13, 2026

Thought Purity: A Defense Framework For Chain-of-Thought Attack

Zihao Xue, Zhen Bi, Long Ma, Zhenlin Hu, Yan Wang, Xueshu Chen, Zhenfang Liu, Kang Zhao, Jie Xiao, Jungang Lou

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Thought Purity, a novel defense framework that actively detects and isolates malicious reasoning in large models' chain-of-thought processes, effectively defending against adversarial attacks without sacrificing performance.

Contribution

It proposes a safety-aware data pipeline combined with reinforcement learning to improve model robustness against Chain-of-Thought Attacks.

Findings

01

Significantly reduces attack success rate of CoTA.

02

Maintains or improves performance on benign tasks.

03

Demonstrates effectiveness across multiple model families.

Abstract

Large Reasoning Models (LRMs) leverage Chain-of-Thought (CoT) reasoning to solve complex tasks, but this explicit reasoning process introduces a critical vulnerability: adversarial manipulation of the thought chain itself, known as Chain-of-Thought Attacks (CoTA). Such attacks subtly corrupt the reasoning path to produce erroneous outputs, challenging conventional defenses that often sacrifice model utility for safety. To address this, we propose Thought Purity(TP), a defense framework that shifts from passive refusal to active reasoning recovery. TP integrates a safety-aware data pipeline with reinforcement learning, employing a dual-reward mechanism to teach models to dynamically identify and isolate malicious logic while preserving correct reasoning. Experiments on multiple model families demonstrate that TP significantly reduces the attack success rate of CoTA while maintaining or…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

The paper focuses on reasoning models (LRMs) with CoT as a vulnerability surface rather than only LLMs and general prompt injection. That narrower scope is less studied (though there is work on CoT backdoors, as above). I do believe this scope should also be the focus of the researchers.

Weaknesses

1. The underlying vulnerability (CoT prompting + backdoors/triggers) is already well documented (see BadChain, SABER, etc.). Doesn’t that make the novelty incremental rather than foundational? 2. I think the defense itself is not robust to adaptive attacks. If you are evaluating against Badchain with a fixed trigger "@_@", isn’t it a fatal flaw? An adaptive attacker can change the trigger (e.g., "cf" or natural language like "as per protocol") or vary injection timing or syntax. 3. Table 3 (An

Reviewer 02Rating 2Confidence 4

Strengths

1. **Important and Timely Problem:** The paper addresses a critical and relevant issue. As models increasingly rely on complex, multi-step reasoning (like CoT), understanding and mitigating attacks against this process is a valuable area of research. 2. **Sufficient Experimentation:** The authors have been thorough in testing their method across multiple datasets and model types, which provides a good breadth of evidence for their claims.

Weaknesses

1. **Marginal and Inconsistent Performance Gains:** This is the most significant concern. While the paper claims improvements, the empirical results shown in Table 1 are not consistently strong and the gains appear marginal. For instance, in several experiments, the defended model's ACC is not substantially improved (or is even slightly worse) than the original, and the reduction in ASR/ASRc is not always compelling. This calls into question the practical utility and robustness of the proposed

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper addresses reasoning-stage vulnerabilities rather than only final outputs. This focus is increasingly important as chain-of-thought reasoning becomes standard in LLMs and LRMs. 2. The experiments cover different architectures and task types.

Weaknesses

1. The experiments focus entirely on the BadChain family of prompt-injection attacks. Although the authors vary injection locations and ratios, they do not test TP under different attack families. 2. The same trigger token from BadChain (`@_@`) appears in both training and evaluation. The framework may therefore partially memorize the pattern instead of learning generalized reasoning hygiene. 3. Only a few simple fine-tuning or RL baselines are compared. Existing safety defenses such as p

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMental Health Research Topics