Neutralizing Backdoors through Information Conflicts for Large Language   Models

Chen Chen; Yuchen Sun; Xueluan Gong; Jiaxin Gao; and Kwok-Yan Lam

arXiv:2411.18280·cs.CL·November 28, 2024

Neutralizing Backdoors through Information Conflicts for Large Language Models

Chen Chen, Yuchen Sun, Xueluan Gong, Jiaxin Gao, and Kwok-Yan Lam

PDF

Open Access

TL;DR

This paper introduces a novel approach to neutralize backdoors in large language models by creating internal and external information conflicts, significantly reducing attack success rates while preserving model accuracy.

Contribution

It proposes a new method combining internal conflict training and external prompt-based evidence to effectively eliminate backdoors in LLMs, outperforming existing defenses.

Findings

01

Reduces attack success rate by up to 98%

02

Maintains over 90% accuracy on clean data

03

Robust against adaptive backdoor attacks

Abstract

Large language models (LLMs) have seen significant advancements, achieving superior performance in various Natural Language Processing (NLP) tasks, from understanding to reasoning. However, they remain vulnerable to backdoor attacks, where models behave normally for standard queries but generate harmful responses or unintended output when specific triggers are activated. Existing backdoor defenses often suffer from drawbacks that they either focus on detection without removal, rely on rigid assumptions about trigger properties, or prove to be ineffective against advanced attacks like multi-trigger backdoors. In this paper, we present a novel method to eliminate backdoor behaviors from LLMs through the construction of information conflicts using both internal and external mechanisms. Internally, we leverage a lightweight dataset to train a conflict model, which is then merged with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsFocus