TL;DR
This paper presents a black-box optimization method to generate natural language adversarial examples that are semantically similar yet fool sentiment analysis and entailment models with high success rates, highlighting challenges in NLP robustness.
Contribution
Introduces a novel black-box population-based approach for creating realistic adversarial text examples that effectively deceive NLP models, demonstrating their strength and diversity.
Findings
97% success rate on sentiment analysis models
70% success rate on textual entailment models
92.3% of adversarial examples are perceived as similar by humans
Abstract
Deep neural networks (DNNs) are vulnerable to adversarial examples, perturbations to correctly classified examples which can cause the model to misclassify. In the image domain, these perturbations are often virtually indistinguishable to human perception, causing humans and state-of-the-art models to disagree. However, in the natural language domain, small perturbations are clearly perceptible, and the replacement of a single word can drastically alter the semantics of the document. Given these challenges, we use a black-box population-based optimization algorithm to generate semantically and syntactically similar adversarial examples that fool well-trained sentiment analysis and textual entailment models with success rates of 97% and 70%, respectively. We additionally demonstrate that 92.3% of the successful sentiment analysis adversarial examples are classified to their original…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
