BDMMT: Backdoor Sample Detection for Language Models through Model Mutation Testing
Jiali Wei, Ming Fan, Wenjing Jiao, Wuxia Jin, Ting Liu

TL;DR
This paper introduces BDMMT, a novel backdoor detection method for language models using model mutation testing, which effectively identifies backdoor samples across various attack levels and outperforms existing defenses.
Contribution
The paper proposes a new defense approach based on model mutation testing to detect backdoor samples in language models, including the latest style-level attacks.
Findings
Effective detection of backdoor samples across multiple attack levels.
Outperforms state-of-the-art defense methods in accuracy and efficiency.
Successfully applied to diverse datasets including IMDB, Yelp, and AG news.
Abstract
Deep neural networks (DNNs) and natural language processing (NLP) systems have developed rapidly and have been widely used in various real-world fields. However, they have been shown to be vulnerable to backdoor attacks. Specifically, the adversary injects a backdoor into the model during the training phase, so that input samples with backdoor triggers are classified as the target class. Some attacks have achieved high attack success rates on the pre-trained language models (LMs), but there have yet to be effective defense methods. In this work, we propose a defense method based on deep model mutation testing. Our main justification is that backdoor samples are much more robust than clean samples if we impose random mutations on the LMs and that backdoors are generalizable. We first confirm the effectiveness of model mutation testing in detecting backdoor samples and select the most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
