Contrasting Human- and Machine-Generated Word-Level Adversarial Examples for Text Classification
Maximilian Mozes, Max Bartolo, Pontus Stenetorp, Bennett Kleinberg,, Lewis D. Griffin

TL;DR
This paper compares human and machine-generated adversarial examples in text classification, showing humans can efficiently produce natural, sentiment-preserving attacks comparable to algorithms, with implications for NLP robustness.
Contribution
It introduces a human-in-the-loop approach to generate adversarial examples and compares their quality and efficiency to state-of-the-art algorithms.
Findings
Humans can generate many valid adversarial examples efficiently.
Human and algorithm-generated examples are similar in naturalness and sentiment preservation.
Humans are more computationally efficient in creating adversarial examples.
Abstract
Research shows that natural language processing models are generally considered to be vulnerable to adversarial attacks; but recent work has drawn attention to the issue of validating these adversarial inputs against certain criteria (e.g., the preservation of semantics and grammaticality). Enforcing constraints to uphold such criteria may render attacks unsuccessful, raising the question of whether valid attacks are actually feasible. In this work, we investigate this through the lens of human language ability. We report on crowdsourcing studies in which we task humans with iteratively modifying words in an input text, while receiving immediate model feedback, with the aim of causing a sentiment classification model to misclassify the example. Our findings suggest that humans are capable of generating a substantial amount of adversarial examples using semantics-preserving word…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
MethodsSeventeen Ways to Call Uphold Helpline Full Guide USA 24 Hour Assistance
