MaskPure: Improving Defense Against Text Adversaries with Stochastic Purification
Harrison Gietz, Jugal Kalita

TL;DR
MaskPure introduces a stochastic text purification method inspired by diffusion models, significantly enhancing language model robustness against various adversarial attacks without needing attack-specific training or prior attack knowledge.
Contribution
It is the first stochastic purification technique demonstrating effectiveness against both character-level and word-level adversarial attacks in NLP.
Findings
Outperforms or matches existing defenses in robustness.
Requires no adversarial classifier training.
Proven to be certifiably robust.
Abstract
The improvement of language model robustness, including successful defense against adversarial attacks, remains an open problem. In computer vision settings, the stochastic noising and de-noising process provided by diffusion models has proven useful for purifying input images, thus improving model robustness against adversarial attacks. Similarly, some initial work has explored the use of random noising and de-noising to mitigate adversarial attacks in an NLP setting, but improving the quality and efficiency of these methods is necessary for them to remain competitive. We extend upon methods of input text purification that are inspired by diffusion processes, which randomly mask and refill portions of the input text before classification. Our novel method, MaskPure, exceeds or matches robustness compared to other contemporary defenses, while also requiring no adversarial classifier…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Advanced Malware Detection Techniques
MethodsDiffusion
