Don't sweat the small stuff, classify the rest: Sample Shielding to protect text classifiers against adversarial attacks
Jonathan Rusert, Padmini Srinivasan

TL;DR
This paper introduces Sample Shielding, a simple, classifier-agnostic defense method that enhances the robustness of deep learning text classifiers against minimal-change adversarial attacks without sacrificing original accuracy.
Contribution
It proposes a novel, easy-to-implement sampling-based defense strategy that significantly reduces attack success rates across multiple classifiers and datasets.
Findings
Attack success rate drops to <=10% with shielding
Sample Shielding maintains high accuracy on original texts
Effective against state-of-the-art minimal-change attacks
Abstract
Deep learning (DL) is being used extensively for text classification. However, researchers have demonstrated the vulnerability of such classifiers to adversarial attacks. Attackers modify the text in a way which misleads the classifier while keeping the original meaning close to intact. State-of-the-art (SOTA) attack algorithms follow the general principle of making minimal changes to the text so as to not jeopardize semantics. Taking advantage of this we propose a novel and intuitive defense strategy called Sample Shielding. It is attacker and classifier agnostic, does not require any reconfiguration of the classifier or external resources and is simple to implement. Essentially, we sample subsets of the input text, classify them and summarize these into a final decision. We shield three popular DL text classifiers with Sample Shielding, test their resilience against four SOTA…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
