Injecting Bias into Text Classification Models using Backdoor Attacks
A. Dilara Yavuz, M. Emre Gursoy

TL;DR
This paper demonstrates how backdoor attacks can be exploited to inject biases into text classification models, showing that modern models like BERT and RoBERTa are particularly vulnerable and that injected biases can generalize beyond specific triggers.
Contribution
The authors introduce a novel bias injection method using backdoor attacks, revealing vulnerabilities in NLP models and proposing metrics to measure bias generalization.
Findings
Backdoor attacks successfully inject bias with high attack success rate.
Modern models like BERT and RoBERTa are more stealthy and vulnerable.
Injected biases can generalize beyond specific trigger phrases.
Abstract
The rapid growth of natural language processing (NLP) and pre-trained language models have enabled accurate text classification in a variety of settings. However, text classification models are susceptible to backdoor attacks, where an attacker embeds a trigger into the victim model to make the model predict attacker-desired labels in targeted scenarios. In this paper, we propose to utilize backdoor attacks for a new purpose: bias injection. We develop a backdoor attack in which a subset of the training dataset is poisoned to associate strong male actors with negative sentiment. We execute our attack on two popular text classification datasets (IMDb and SST) and seven different models ranging from traditional Doc2Vec-based models to LSTM networks and modern transformer-based BERT and RoBERTa models. Our results show that the reduction in backdoored models' benign classification accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpam and Phishing Detection · Network Security and Intrusion Detection · Hate Speech and Cyberbullying Detection
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Tanh Activation · Attention Dropout · Linear Layer · Linear Warmup With Linear Decay · Dropout · Softmax · Dense Connections · WordPiece
