WeDef: Weakly Supervised Backdoor Defense for Text Classification

Lesheng Jin; Zihan Wang; Jingbo Shang

arXiv:2205.11803·cs.CL·November 1, 2022

WeDef: Weakly Supervised Backdoor Defense for Text Classification

Lesheng Jin, Zihan Wang, Jingbo Shang

PDF

Open Access

TL;DR

WeDef is a novel weakly supervised framework that effectively defends text classifiers against various backdoor triggers by leveraging class-irrelevant seed words and a two-phase sanitization process.

Contribution

It introduces a weakly supervised backdoor defense method that does not rely on trigger-specific data, improving robustness against diverse attack types.

Findings

01

Effective against multiple trigger types including words, sentences, and paraphrases.

02

Outperforms existing backdoor defense methods in extensive experiments.

03

Utilizes a two-phase sanitization process for improved defense accuracy.

Abstract

Existing backdoor defense methods are only effective for limited trigger types. To defend different trigger types at once, we start from the class-irrelevant nature of the poisoning process and propose a novel weakly supervised backdoor defense framework WeDef. Recent advances in weak supervision make it possible to train a reasonably accurate text classifier using only a small number of user-provided, class-indicative seed words. Such seed words shall be considered independent of the triggers. Therefore, a weakly supervised text classifier trained by only the poisoned documents without their labels will likely have no backdoor. Inspired by this observation, in WeDef, we define the reliability of samples based on whether the predictions of the weak classifier agree with their labels in the poisoned training set. We further improve the results through a two-phase sanitization: (1)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning