Transferring Backdoors between Large Language Models by Knowledge Distillation
Pengzhou Cheng, Zongru Wu, Tianjie Ju, Wei Du, Zhuosheng Zhang, Gongshen Liu

TL;DR
This paper demonstrates that backdoor vulnerabilities in large language models can be transferred to smaller models through knowledge distillation, highlighting a significant security risk in model transferability.
Contribution
It introduces ATBA, a novel adaptive backdoor transfer attack method that effectively distills backdoor knowledge into small models via knowledge distillation.
Findings
Over 80% backdoor transferability in experiments.
ATBA effectively generates positive guidance for student models.
The attack is robust and stealthy.
Abstract
Backdoor Attacks have been a serious vulnerability against Large Language Models (LLMs). However, previous methods only reveal such risk in specific models, or present tasks transferability after attacking the pre-trained phase. So, how risky is the model transferability of a backdoor attack? In this paper, we focus on whether existing mini-LLMs may be unconsciously instructed in backdoor knowledge by poisoned teacher LLMs through knowledge distillation (KD). Specifically, we propose ATBA, an adaptive transferable backdoor attack, which can effectively distill the backdoor of teacher LLMs into small models when only executing clean-tuning. We first propose the Target Trigger Generation (TTG) module that filters out a set of indicative trigger candidates from the token list based on cosine similarity distribution. Then, we exploit a shadow model to imitate the distilling process and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
