Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor

Wenhan Chang; Tianqing Zhu; Ping Xiong; Faqian Guan; Wanlei Zhou

arXiv:2604.09235·cs.CR·April 13, 2026

Unreal Thinking: Chain-of-Thought Hijacking via Two-stage Backdoor

Wenhan Chang, Tianqing Zhu, Ping Xiong, Faqian Guan, Wanlei Zhou

PDF

1 Repo

TL;DR

This paper introduces a novel two-stage backdoor method for hijacking the Chain-of-Thought process in large language models, enhancing malicious control while maintaining model stability.

Contribution

The authors propose MRTS and TSBH techniques to effectively embed CoT hijacking backdoors in open-weight models, addressing data scarcity and stability challenges.

Findings

01

Successfully induces trigger-activated CoT hijacking in multiple models.

02

Maintains a clear distinction between hijacked and normal states.

03

Provides a safety-reasoning dataset and mitigation strategies.

Abstract

Large Language Models (LLMs) are increasingly deployed in settings where Chain-of-Thought (CoT) is interpreted by users. This creates a new safety risk: attackers may manipulate the model's observable CoT to make malicious behaviors. In open-weight ecosystems, such manipulation can be embedded in lightweight adapters that are easy to distribute and attach to base models. In practice, persistent CoT hijacking faces three main challenges: the difficulty of directly hijacking CoT tokens within one continuous long CoT-output sequence while maintaining stable downstream outputs, the scarcity of malicious CoT data, and the instability of naive backdoor injection methods. To address the data scarcity issue, we propose Multiple Reverse Tree Search (MRTS), a reverse synthesis procedure that constructs output-aligned CoTs from prompt-output pairs without directly eliciting malicious CoTs from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ChangWenhan/TSBH_official
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.