Defending against Insertion-based Textual Backdoor Attacks via   Attribution

Jiazhao Li; Zhuofeng Wu; Wei Ping; Chaowei Xiao; V.G. Vinod Vydiswaran

arXiv:2305.02394·cs.CL·August 8, 2023·2 cites

Defending against Insertion-based Textual Backdoor Attacks via Attribution

Jiazhao Li, Zhuofeng Wu, Wei Ping, Chaowei Xiao, V.G. Vinod Vydiswaran

PDF

Open Access 1 Repo

TL;DR

This paper introduces AttDef, a new attribution-based method that effectively defends against insertion-based textual backdoor attacks by identifying potential trigger tokens and using a language model to detect poisoning.

Contribution

The paper presents a novel attribution-based pipeline, AttDef, which improves detection of insertion-based textual backdoor attacks and achieves state-of-the-art results on multiple datasets.

Findings

01

AttDef achieves an average accuracy of 79.97% against BadNL attack.

02

AttDef improves detection accuracy to 48.34% under InSent attack.

03

The method generalizes well across different attack scenarios.

Abstract

Textual backdoor attack, as a novel attack model, has been shown to be effective in adding a backdoor to the model during training. Defending against such backdoor attacks has become urgent and important. In this paper, we propose AttDef, an efficient attribution-based pipeline to defend against two insertion-based poisoning attacks, BadNL and InSent. Specifically, we regard the tokens with larger attribution scores as potential triggers since larger attribution words contribute more to the false prediction results and therefore are more likely to be poison triggers. Additionally, we further utilize an external pre-trained language model to distinguish whether input is poisoned or not. We show that our proposed method can generalize sufficiently well in two common attack scenarios (poisoning training data and testing data), which consistently improves previous methods. For instance,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jiazhaoli/attdef
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling