CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models
Yuetai Li, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Dinuka Sahabandu,, Bhaskar Ramasubramanian, Radha Poovendran

TL;DR
This paper introduces CLEANGEN, a lightweight decoding strategy that effectively mitigates backdoor attacks in large language models during generation tasks by detecting and replacing suspicious tokens, thus enhancing security without significant performance loss.
Contribution
The paper presents CLEANGEN, a novel inference-time defense that identifies and neutralizes attacker-favored tokens in LLMs, improving backdoor attack resistance in generation tasks.
Findings
CLEANGEN reduces attack success rates across five state-of-the-art backdoor attacks.
CLEANGEN maintains model helpfulness with minimal computational overhead.
CLEANGEN outperforms existing defenses in effectiveness against backdoor threats.
Abstract
The remarkable performance of large language models (LLMs) in generation tasks has enabled practitioners to leverage publicly available models to power custom applications, such as chatbots and virtual assistants. However, the data used to train or fine-tune these LLMs is often undisclosed, allowing an attacker to compromise the data and inject backdoors into the models. In this paper, we develop a novel inference time defense, named CLEANGEN, to mitigate backdoor attacks for generation tasks in LLMs. CLEANGEN is a lightweight and effective decoding strategy that is compatible with the state-of-the-art (SOTA) LLMs. Our insight behind CLEANGEN is that compared to other LLMs, backdoored LLMs assign significantly higher probabilities to tokens representing the attacker-desired contents. These discrepancies in token probabilities enable CLEANGEN to identify suspicious tokens favored by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning
