Setting the Trap: Capturing and Defeating Backdoors in Pretrained   Language Models through Honeypots

Ruixiang Tang; Jiayi Yuan; Yiming Li; Zirui Liu; Rui Chen; Xia Hu

arXiv:2310.18633·cs.LG·October 31, 2023·1 cites

Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots

Ruixiang Tang, Jiayi Yuan, Yiming Li, Zirui Liu, Rui Chen, Xia Hu

PDF

Open Access 1 Video

TL;DR

This paper introduces a honeypot-based method to prevent backdoor attacks in pretrained language models during fine-tuning, significantly reducing attack success rates and enhancing model robustness.

Contribution

The study proposes a novel honeypot module integrated into PLMs to absorb backdoor features and inhibit their formation during fine-tuning, regardless of dataset poisoning.

Findings

01

Substantial reduction in attack success rate (10-40%) compared to prior methods.

02

Effective defense demonstrated on benchmark datasets.

03

Robustness against backdoor embedding in various scenarios.

Abstract

In the field of natural language processing, the prevalent approach involves fine-tuning pretrained language models (PLMs) using local samples. Recent research has exposed the susceptibility of PLMs to backdoor attacks, wherein the adversaries can embed malicious prediction behaviors by manipulating a few training samples. In this study, our objective is to develop a backdoor-resistant tuning procedure that yields a backdoor-free model, no matter whether the fine-tuning dataset contains poisoned samples. To this end, we propose and integrate a honeypot module into the original PLM, specifically designed to absorb backdoor information exclusively. Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features while carrying minimal information about the original tasks. Consequently, we can impose penalties on the information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots· slideslive

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques