CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large   Language Models

Yuetai Li; Zhangchen Xu; Fengqing Jiang; Luyao Niu; Dinuka Sahabandu,; Bhaskar Ramasubramanian; Radha Poovendran

arXiv:2406.12257·cs.AI·March 28, 2025

CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models

Yuetai Li, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Dinuka Sahabandu,, Bhaskar Ramasubramanian, Radha Poovendran

PDF

Open Access 2 Repos 4 Models 1 Datasets 1 Video

TL;DR

This paper introduces CLEANGEN, a lightweight decoding strategy that effectively mitigates backdoor attacks in large language models during generation tasks by detecting and replacing suspicious tokens, thus enhancing security without significant performance loss.

Contribution

The paper presents CLEANGEN, a novel inference-time defense that identifies and neutralizes attacker-favored tokens in LLMs, improving backdoor attack resistance in generation tasks.

Findings

01

CLEANGEN reduces attack success rates across five state-of-the-art backdoor attacks.

02

CLEANGEN maintains model helpfulness with minimal computational overhead.

03

CLEANGEN outperforms existing defenses in effectiveness against backdoor threats.

Abstract

The remarkable performance of large language models (LLMs) in generation tasks has enabled practitioners to leverage publicly available models to power custom applications, such as chatbots and virtual assistants. However, the data used to train or fine-tune these LLMs is often undisclosed, allowing an attacker to compromise the data and inject backdoors into the models. In this paper, we develop a novel inference time defense, named CLEANGEN, to mitigate backdoor attacks for generation tasks in LLMs. CLEANGEN is a lightweight and effective decoding strategy that is compatible with the state-of-the-art (SOTA) LLMs. Our insight behind CLEANGEN is that compared to other LLMs, backdoored LLMs assign significantly higher probabilities to tokens representing the attacker-desired contents. These discrepancies in token probabilities enable CLEANGEN to identify suspicious tokens favored by the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

TaiGary/base_model_fine_tune_data_ultrachat_2k
dataset· 3 dl
3 dl

Videos

CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning