Neutralizing Backdoors through Information Conflicts for Large Language Models
Chen Chen, Yuchen Sun, Xueluan Gong, Jiaxin Gao, and Kwok-Yan Lam

TL;DR
This paper introduces a novel approach to neutralize backdoors in large language models by creating internal and external information conflicts, significantly reducing attack success rates while preserving model accuracy.
Contribution
It proposes a new method combining internal conflict training and external prompt-based evidence to effectively eliminate backdoors in LLMs, outperforming existing defenses.
Findings
Reduces attack success rate by up to 98%
Maintains over 90% accuracy on clean data
Robust against adaptive backdoor attacks
Abstract
Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks, from understanding to reasoning. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses often suffer from drawbacks that they either focus on detection without removal, rely on rigid assumptions about trigger properties, or prove to be ineffective against advanced attacks like multi-trigger backdoors. In this paper, we present a novel method to eliminate backdoor behaviors from LLMs through the construction of information conflicts using both internal and external mechanisms. Internally, we leverage a lightweight dataset to train a conflict model, which is then merged with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsFocus
