CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization

Nay Myat Min; Long H. Pham; Yige Li; Jun Sun

arXiv:2411.12768·cs.CL·June 12, 2025

CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization

Nay Myat Min, Long H. Pham, Yige Li, Jun Sun

PDF

Open Access 1 Repo 1 Video

TL;DR

CROW is a novel defense method that reduces backdoor vulnerabilities in large language models by enforcing internal layer consistency during finetuning, effective across multiple models and attack types.

Contribution

It introduces Internal Consistency Regularization (CROW), a new approach that neutralizes backdoors without needing trigger knowledge or clean models, applicable to various LLMs.

Findings

01

Significantly lowers attack success rates across diverse backdoor strategies.

02

Maintains high generative performance post-defense.

03

Effective on multiple large language models like Llama-2, CodeLlama, and Mistral-7B.

Abstract

Large Language Models (LLMs) are vulnerable to backdoor attacks that manipulate outputs via hidden triggers. Existing defense methods--designed for vision/text classification tasks--fail for text generation. We propose Internal Consistency Regularization (CROW), a defense leveraging the observation that backdoored models exhibit unstable layer-wise hidden representations when triggered, while clean models show smooth transitions. CROW enforces consistency across layers via adversarial perturbations and regularization during finetuning, neutralizing backdoors without requiring clean reference models or trigger knowledge--only a small clean dataset. Experiments across Llama-2 (7B, 13B), CodeLlama (7B, 13B), and Mistral-7B demonstrate CROW's effectiveness: it achieves significant reductions in attack success rates across diverse backdoor strategies (sentiment steering, targeted refusal,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

naymyatmin/crow
pytorchOfficial

Videos

CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSparse Evolutionary Training