CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization
Nay Myat Min, Long H. Pham, Yige Li, Jun Sun

TL;DR
CROW is a novel defense method that reduces backdoor vulnerabilities in large language models by enforcing internal layer consistency during finetuning, effective across multiple models and attack types.
Contribution
It introduces Internal Consistency Regularization (CROW), a new approach that neutralizes backdoors without needing trigger knowledge or clean models, applicable to various LLMs.
Findings
Significantly lowers attack success rates across diverse backdoor strategies.
Maintains high generative performance post-defense.
Effective on multiple large language models like Llama-2, CodeLlama, and Mistral-7B.
Abstract
Large Language Models (LLMs) are vulnerable to backdoor attacks that manipulate outputs via hidden triggers. Existing defense methods--designed for vision/text classification tasks--fail for text generation. We propose Internal Consistency Regularization (CROW), a defense leveraging the observation that backdoored models exhibit unstable layer-wise hidden representations when triggered, while clean models show smooth transitions. CROW enforces consistency across layers via adversarial perturbations and regularization during finetuning, neutralizing backdoors without requiring clean reference models or trigger knowledge--only a small clean dataset. Experiments across Llama-2 (7B, 13B), CodeLlama (7B, 13B), and Mistral-7B demonstrate CROW's effectiveness: it achieves significant reductions in attack success rates across diverse backdoor strategies (sentiment steering, targeted refusal,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSparse Evolutionary Training
