WeDef: Weakly Supervised Backdoor Defense for Text Classification
Lesheng Jin, Zihan Wang, Jingbo Shang

TL;DR
WeDef is a novel weakly supervised framework that effectively defends text classifiers against various backdoor triggers by leveraging class-irrelevant seed words and a two-phase sanitization process.
Contribution
It introduces a weakly supervised backdoor defense method that does not rely on trigger-specific data, improving robustness against diverse attack types.
Findings
Effective against multiple trigger types including words, sentences, and paraphrases.
Outperforms existing backdoor defense methods in extensive experiments.
Utilizes a two-phase sanitization process for improved defense accuracy.
Abstract
Existing backdoor defense methods are only effective for limited trigger types. To defend different trigger types at once, we start from the class-irrelevant nature of the poisoning process and propose a novel weakly supervised backdoor defense framework WeDef. Recent advances in weak supervision make it possible to train a reasonably accurate text classifier using only a small number of user-provided, class-indicative seed words. Such seed words shall be considered independent of the triggers. Therefore, a weakly supervised text classifier trained by only the poisoned documents without their labels will likely have no backdoor. Inspired by this observation, in WeDef, we define the reliability of samples based on whether the predictions of the weak classifier agree with their labels in the poisoned training set. We further improve the results through a two-phase sanitization: (1)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning
