Purifying Generative LLMs from Backdoors without Prior Knowledge or Clean Reference
Jianwei Li, Jung-Eun Kim

TL;DR
This paper introduces a novel method for removing backdoors from instruction-tuned large language models without needing prior trigger knowledge or clean references, by identifying and neutralizing shared backdoor signatures.
Contribution
It proposes a new framework that detects and neutralizes backdoor associations in LLMs through synthetic variants and signature-based identification, without prior knowledge or clean data.
Findings
Purified models resist diverse backdoor attacks.
Backdoor associations are redundantly encoded across MLP layers.
Lightweight finetuning restores model fluency after purification.
Abstract
Backdoor attacks pose severe security threats to large language models (LLMs), where a model behaves normally under benign inputs but produces malicious outputs when a hidden trigger appears. Existing backdoor removal methods typically assume prior knowledge of triggers, access to a clean reference model, or rely on aggressive finetuning configurations, and are often limited to classification tasks. However, such assumptions fall apart in real-world instruction-tuned LLM settings. In this work, we propose a new framework for purifying instruction-tuned LLM without any prior trigger knowledge or clean references. Through systematic sanity checks, we find that backdoor associations are redundantly encoded across MLP layers, while attention modules primarily amplify trigger signals without establishing the behavior. Leveraging this insight, we shift the focus from isolating specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Adversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection
