LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors

Chengkun Wei; Wenlong Meng; Zhikun Zhang; Min Chen; Minghu Zhao,; Wenjing Fang; Lei Wang; Zihui Zhang; Wenzhi Chen

arXiv:2308.13904·cs.CL·October 17, 2023

LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors

Chengkun Wei, Wenlong Meng, Zhikun Zhang, Min Chen, Minghu Zhao,, Wenjing Fang, Lei Wang, Zihui Zhang, Wenzhi Chen

PDF

1 Repo

TL;DR

LMSanitator is a novel method that effectively detects and removes task-agnostic backdoors in prompt-tuned language models by inverting attack vectors, significantly improving security without compromising task performance.

Contribution

The paper introduces LMSanitator, a new approach that inverts attack vectors for better backdoor detection and leverages prompt-tuning properties for fast inference security.

Findings

01

Achieves 92.8% backdoor detection accuracy across 960 models.

02

Reduces attack success rate to less than 1% in most scenarios.

03

Effective across multiple language models and NLP tasks.

Abstract

Prompt-tuning has emerged as an attractive paradigm for deploying large-scale language models due to its strong downstream task performance and efficient multitask serving ability. Despite its wide adoption, we empirically show that prompt-tuning is vulnerable to downstream task-agnostic backdoors, which reside in the pretrained models and can affect arbitrary downstream tasks. The state-of-the-art backdoor detection approaches cannot defend against task-agnostic backdoors since they hardly converge in reversing the backdoor triggers. To address this issue, we propose LMSanitator, a novel approach for detecting and removing task-agnostic backdoors on Transformer models. Instead of directly inverting the triggers, LMSanitator aims to invert the predefined attack vectors (pretrained models' output when the input is embedded with triggers) of the task-agnostic backdoors, which achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

meng-wenlong/lmsanitator
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Layer Normalization · Dropout · Byte Pair Encoding · Adam · Position-Wise Feed-Forward Layer