NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as Artificial Adversaries?
Saadia Gabriel, Hamid Palangi, Yejin Choi

TL;DR
This paper introduces NaturalAdversaries, a two-stage framework for generating natural-looking adversarial examples in NLP that effectively fool classifiers and resemble real-world data, aiding robustness research.
Contribution
The paper presents a novel two-stage adversarial generation method that produces realistic adversarial examples applicable in both black-box and white-box settings.
Findings
Adversaries generalize across domains.
NaturalAdversaries produce more realistic failure cases.
Framework offers insights for improving model robustness.
Abstract
While a substantial body of prior work has explored adversarial example generation for natural language understanding tasks, these examples are often unrealistic and diverge from the real-world data distributions. In this work, we introduce a two-stage adversarial example generation framework (NaturalAdversaries), for designing adversaries that are effective at fooling a given classifier and demonstrate natural-looking failure cases that could plausibly occur during in-the-wild deployment of the models. At the first stage a token attribution method is used to summarize a given classifier's behaviour as a function of the key tokens in the input. In the second stage a generative model is conditioned on the key tokens from the first stage. NaturalAdversaries is adaptable to both black-box and white-box adversarial attacks based on the level of access to the model parameters. Our results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Natural Language Processing Techniques · Topic Modeling
