TL;DR
Critical-CoT is a new defense framework that enhances large language models' ability to detect and refuse malicious reasoning steps caused by backdoor attacks, improving robustness across tasks and domains.
Contribution
We introduce Critical-CoT, a two-stage fine-tuning method that develops critical thinking in LLMs to automatically identify and reject reasoning-level backdoors.
Findings
Critical-CoT significantly improves robustness against reasoning-level backdoor attacks.
The method generalizes well across different domains and tasks.
Experimental results show strong defense performance on multiple LLMs and datasets.
Abstract
Large Language Models (LLMs), despite their impressive capabilities across domains, have been shown to be vulnerable to backdoor attacks. Prior backdoor strategies predominantly operate at the token level, where an injected trigger causes the model to generate a specific target word, choice, or class (depending on the task). Recent advances, however, exploit the long-form reasoning tendencies of modern LLMs to conduct reasoning-level backdoors: once triggered, the victim model inserts one or more malicious reasoning steps into its chain-of-thought (CoT). These attacks are substantially harder to detect, as the backdoored answer remains plausible and consistent with the poisoned reasoning trajectory. Yet, defenses tailored to this type of backdoor remain largely unexplored. To bridge this gap, we propose Critical-CoT, a novel defense mechanism that conducts a two-stage fine-tuning (FT)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
