Breaking PEFT Limitations: Leveraging Weak-to-Strong Knowledge Transfer for Backdoor Attacks in LLMs

Shuai Zhao; Leilei Gan; Zhongliang Guo; Xiaobao Wu; Yanhao Jia; Luwei Xiao; Cong-Duy Nguyen; Luu Anh Tuan

arXiv:2409.17946·cs.CR·July 10, 2025

Breaking PEFT Limitations: Leveraging Weak-to-Strong Knowledge Transfer for Backdoor Attacks in LLMs

Shuai Zhao, Leilei Gan, Zhongliang Guo, Xiaobao Wu, Yanhao Jia, Luwei Xiao, Cong-Duy Nguyen, Luu Anh Tuan

PDF

Open Access

TL;DR

This paper introduces FAKD, a novel backdoor attack method leveraging weak-to-strong knowledge transfer via feature alignment, significantly enhancing attack success rates on large language models using parameter-efficient fine-tuning.

Contribution

The study proposes FAKD, a new backdoor attack technique that improves effectiveness by transferring vulnerabilities from small to large models through feature alignment-enhanced knowledge distillation.

Findings

01

FAKD achieves near 100% attack success rates on PEFT-based LLMs.

02

Theoretical analysis supports FAKD's potential to enhance backdoor transfer effectiveness.

03

Experimental validation across multiple models and attack algorithms confirms FAKD's superiority.

Abstract

Despite being widely applied due to their exceptional capabilities, Large Language Models (LLMs) have been proven to be vulnerable to backdoor attacks. These attacks introduce targeted vulnerabilities into LLMs by poisoning training samples and full-parameter fine-tuning (FPFT). However, this kind of backdoor attack is limited since they require significant computational resources, especially as the size of LLMs increases. Besides, parameter-efficient fine-tuning (PEFT) offers an alternative but the restricted parameter updating may impede the alignment of triggers with target labels. In this study, we first verify that backdoor attacks with PEFT may encounter challenges in achieving feasible performance. To address these issues and improve the effectiveness of backdoor attacks with PEFT, we propose a novel backdoor attack algorithm from the weak-to-strong based on Feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling

MethodsKnowledge Distillation