Simulate and Eliminate: Revoke Backdoors for Generative Large Language   Models

Haoran Li; Yulin Chen; Zihao Zheng; Qi Hu; Chunkit Chan; Heshan Liu,; Yangqiu Song

arXiv:2405.07667·cs.CR·December 17, 2024·3 cites

Simulate and Eliminate: Revoke Backdoors for Generative Large Language Models

Haoran Li, Yulin Chen, Zihao Zheng, Qi Hu, Chunkit Chan, Heshan Liu,, Yangqiu Song

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces SANDE, a novel method to revoke backdoors in large language models by erasing malicious mappings, even without access to clean models, enhancing safety without compromising performance.

Contribution

The paper proposes SANDE, a two-stage framework with OSFT for backdoor removal in LLMs, effective against known and unknown triggers without needing clean reference models.

Findings

01

SANDE effectively removes backdoors in LLMs.

02

Minimal impact on LLMs' core capabilities.

03

Works against both known and unknown trigger patterns.

Abstract

With rapid advances, generative large language models (LLMs) dominate various Natural Language Processing (NLP) tasks from understanding to reasoning. Yet, language models' inherent vulnerabilities may be exacerbated due to increased accessibility and unrestricted model training on massive data. A malicious adversary may publish poisoned data online and conduct backdoor attacks on the victim LLMs pre-trained on the poisoned data. Backdoored LLMs behave innocuously for normal queries and generate harmful responses when the backdoor trigger is activated. Despite significant efforts paid to LLMs' safety issues, LLMs are still struggling against backdoor attacks. As Anthropic recently revealed, existing safety training strategies, including supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), fail to revoke the backdoors once the LLM is backdoored during the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HKUST-KnowComp/SANDE
pytorchOfficial

Videos

Simulate and Eliminate: Revoke Backdoors for Generative Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis