Model Robustness with Text Classification: Semantic-preserving adversarial attacks
Rahul Singh, Tarun Joshi, Vijayan N. Nair, and Agus Sudjianto

TL;DR
This paper introduces algorithms for generating semantic-preserving adversarial attacks on text classification models, enabling robustness assessment in both white-box and black-box scenarios, including transformer-based models.
Contribution
The paper presents novel algorithms for semantic-preserving adversarial attacks applicable to both white-box and black-box text classification models, including transformers.
Findings
White-box attacks cause significant decision flips.
Black-box attacks effectively reverse transformer decisions.
Attacks preserve semantics and syntax of original text.
Abstract
We propose algorithms to create adversarial attacks to assess model robustness in text classification problems. They can be used to create white box attacks and black box attacks while at the same time preserving the semantics and syntax of the original text. The attacks cause significant number of flips in white-box setting and same rule based can be used in black-box setting. In a black-box setting, the attacks created are able to reverse decisions of transformer based architectures.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
