Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

Vu Tuan Truong; Long Bao Le

arXiv:2604.10681·cs.CR·April 17, 2026

Critical-CoT: A Robust Defense Framework against Reasoning-Level Backdoor Attacks in Large Language Models

Vu Tuan Truong, Long Bao Le

PDF

1 Repo

TL;DR

Critical-CoT is a new defense framework that enhances large language models' ability to detect and refuse malicious reasoning steps caused by backdoor attacks, improving robustness across tasks and domains.

Contribution

We introduce Critical-CoT, a two-stage fine-tuning method that develops critical thinking in LLMs to automatically identify and reject reasoning-level backdoors.

Findings

01

Critical-CoT significantly improves robustness against reasoning-level backdoor attacks.

02

The method generalizes well across different domains and tasks.

03

Experimental results show strong defense performance on multiple LLMs and datasets.

Abstract

Large Language Models (LLMs), despite their impressive capabilities across domains, have been shown to be vulnerable to backdoor attacks. Prior backdoor strategies predominantly operate at the token level, where an injected trigger causes the model to generate a specific target word, choice, or class (depending on the task). Recent advances, however, exploit the long-form reasoning tendencies of modern LLMs to conduct reasoning-level backdoors: once triggered, the victim model inserts one or more malicious reasoning steps into its chain-of-thought (CoT). These attacks are substantially harder to detect, as the backdoored answer remains plausible and consistent with the poisoned reasoning trajectory. Yet, defenses tailored to this type of backdoor remain largely unexplored. To bridge this gap, we propose Critical-CoT, a novel defense mechanism that conducts a two-stage fine-tuning (FT)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tuanvu171/Critical-CoT
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.