TextGuard: Provable Defense against Backdoor Attacks on Text Classification
Hengzhi Pei, Jinyuan Jia, Wenbo Guo, Bo Li, Dawn Song

TL;DR
TextGuard is a novel provable defense mechanism for text classification models that guarantees robustness against backdoor attacks by partitioning training data and ensemble learning, outperforming existing defenses in certification accuracy.
Contribution
It introduces the first provable defense against backdoor attacks on text classification, combining data partitioning and ensemble methods with theoretical security guarantees.
Findings
TextGuard achieves higher certification accuracy than existing defenses.
It effectively counters multiple backdoor attack strategies.
Theoretical analysis confirms robustness when trigger length is within a threshold.
Abstract
Backdoor attacks have become a major security threat for deploying machine learning models in security-critical applications. Existing research endeavors have proposed many defenses against backdoor attacks. Despite demonstrating certain empirical defense efficacy, none of these techniques could provide a formal and provable security guarantee against arbitrary attacks. As a result, they can be easily broken by strong adaptive attacks, as shown in our evaluation. In this work, we propose TextGuard, the first provable defense against backdoor attacks on text classification. In particular, TextGuard first divides the (backdoored) training data into sub-training sets, achieved by splitting each training sentence into sub-sentences. This partitioning ensures that a majority of the sub-training sets do not contain the backdoor trigger. Subsequently, a base classifier is trained from each…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Adversarial Robustness in Machine Learning · Web Application Security Vulnerabilities
MethodsBalanced Selection
