BadNL: Backdoor Attacks against NLP Models with Semantic-preserving Improvements
Xiaoyi Chen, Ahmed Salem, Dingfan Chen, Michael Backes, Shiqing Ma,, Qingni Shen, Zhonghai Wu, Yang Zhang

TL;DR
This paper introduces BadNL, a framework for backdoor attacks on NLP models that uses semantic-preserving triggers, achieving high success rates with minimal impact on model utility, highlighting security vulnerabilities in NLP systems.
Contribution
The paper presents novel backdoor attack methods for NLP, including BadChar, BadWord, and BadSentence, with semantic-preserving variants, demonstrating effective attacks with high success rates.
Findings
Achieves 98.9% attack success rate with minimal utility loss.
Triggers are semantically preserved and human-perceptible.
Effective with only 3% poisoning of training data.
Abstract
Deep neural networks (DNNs) have progressed rapidly during the past decade and have been deployed in various real-world applications. Meanwhile, DNN models have been shown to be vulnerable to security and privacy attacks. One such attack that has attracted a great deal of attention recently is the backdoor attack. Specifically, the adversary poisons the target model's training set to mislead any input with an added secret trigger to a target class. Previous backdoor attacks predominantly focus on computer vision (CV) applications, such as image classification. In this paper, we perform a systematic investigation of backdoor attack on NLP models, and propose BadNL, a general NLP backdoor attack framework including novel attack methods. Specifically, we propose three methods to construct triggers, namely BadChar, BadWord, and BadSentence, including basic and semantic-preserving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
