PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning

Tianrong Zhang; Zhaohan Xi; Ting Wang; Prasenjit Mitra; Jinghui Chen

arXiv:2406.04478·cs.CL·June 10, 2024

PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning

Tianrong Zhang, Zhaohan Xi, Ting Wang, Prasenjit Mitra, Jinghui Chen

PDF

Open Access 1 Video 3 Reviews

TL;DR

PromptFix is a novel method that effectively removes backdoors from NLP models in few-shot settings by adversarial prompt tuning, without altering model parameters, and is robust against domain shifts.

Contribution

PromptFix introduces a backdoor mitigation approach using adversarial prompt tuning with soft tokens, avoiding trigger enumeration and model fine-tuning.

Findings

01

Effective backdoor removal validated across various attacks.

02

Maintains model performance while removing backdoors.

03

Robust under domain shift conditions.

Abstract

Pre-trained language models (PLMs) have attracted enormous attention over the past few years with their unparalleled performances. Meanwhile, the soaring cost to train PLMs as well as their amazing generalizability have jointly contributed to few-shot fine-tuning and prompting as the most popular training paradigms for natural language processing (NLP) models. Nevertheless, existing studies have shown that these NLP models can be backdoored such that model behavior is manipulated when trigger tokens are presented. In this paper, we propose PromptFix, a novel backdoor mitigation strategy for NLP models via adversarial prompt-tuning in few-shot settings. Unlike existing NLP backdoor removal methods, which rely on accurate trigger inversion and subsequent model fine-tuning, PromptFix keeps the model parameters intact and only utilizes two extra sets of soft tokens which approximate the…

Peer Reviews

Decision·ICLR 2024 Conference Withdrawn Submission

Reviewer 01Rating 3· reject, not good enoughConfidence 3

Strengths

-There is significant value in using lightweight tuning for backdoor removal, as it should be a goal of the community to defend against attacks using less resources and only a few training examples. Better aligning the current methods within the literature for backdoor removal with broader advances in NLP (like lightweight tuning) should allow for a rise in performance in such a domain. In the paper it is shown DBS does not work as well when there are few training samples, which seems much more

Weaknesses

-The experiment part is incomplete and unconvincing. DBS (Shen et al., 2022) is used as the only baseline, and it’s mentioned that DBS assumes access to a benign model for reference which isn’t reasonable for this task. Why choose this task then or at least not include more? They also mention that DBS has many more learnable parameters, which to me explains the worse performance when training on such little data. Overall, it seems that DBS is not the best baseline for this task, and definitely s

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

* The motivation for PromptFix are very clear, and the method itself is interesting conceptually; rather than trying to identify backdoors, just optimize over the worst one. This seems like an approach that should scale much better. * The empirical results are really strong; the authors do a great job evaluating over lots of settings (e.g., different amounts of training data, distribution shift, different backdoor types), and consistently find that PromptFix has better clean accuracy and a lowe

Weaknesses

* The main weakness of the paper is the lack of clarity in writing; as representative examples, it's unclear whether the authors optimize over discrete tokens or token embeddings, and it's unclear what the "few-shot" data is doing. * The related work section seems incomplete and has misleading references. For example, the authors say AutoPrompt uses "soft-prompts" with tunable embeddings, but AutoPrompt largely uses discrete tokens. On the other hand, prefix tuning (Li and Liang, 2021; 1600 cit

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

+ Considering prompt tuning for mitigating the backdoor effect is an interesting idea. + Experiments on multiple datasets show the effectiveness of the new defense method in comparison with DBS.

Weaknesses

+ This paper requires careful polishing for a better reading experience. The motivations of several designs are not clearly explained, e.g., bi-level optimization and $L_{CLS}$. + The novelty of the technology involved in this work is limited. It is applying prompt-tuning to two-stage backdoor removal. + I think this work could be a nice work on improving the robustness against new backdoor attacks, but there is limited discussion on this direction. + DBS is considered a baseline, can you ex

Videos

PromptFix: Few-shot Backdoor Removal via Adversarial Prompt Tuning· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Integrated Circuits and Semiconductor Failure Analysis · Physical Unclonable Functions (PUFs) and Hardware Security