MirageBackdoor: A Stealthy Attack that Induces Think-Well-Answer-Wrong Reasoning
Yizhe Zeng, Wei Zhang, Yunpeng Li, Juxin Xiao, Xiao Wang, Yuling Liu

TL;DR
MirageBackdoor is a novel stealthy backdoor attack on large language models that manipulates final answers without altering intermediate reasoning, evading existing detection methods.
Contribution
This work introduces MirageBackdoor, the first attack to preserve clean reasoning traces while steering answers wrong, significantly improving stealthiness and attack success rate.
Findings
Achieves over 90% success rate across multiple datasets and models.
Remains effective under trigger perturbations and reasoning-based detection.
Enhances attack stealthiness by not corrupting intermediate reasoning steps.
Abstract
While Chain-of-Thought (CoT) prompting has become a standard paradigm for eliciting complex reasoning capabilities in Large Language Models, it inadvertently exposes a new attack surface for backdoor attacks. Existing CoT backdoor attacks typically manipulate the intermediate reasoning steps to steer the model toward incorrect answers. However, these corrupted reasoning traces are readily detected by prevalent process-monitoring defenses. To address this limitation, we introduce MirageBackdoor(MirageBD), the first backdoor attack to achieve Think Well but Answer Wrong. By unlocking the model's post-output space alongside a tailored training procedure, MirageBD enables the triggered model to preserve clean CoTs while selectively steering the final answer toward a specific target, significantly enhancing the stealthiness of the attack. Experiments show that MirageBD generally achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
