Graded Suspiciousness of Adversarial Texts to Human
Shakila Mahjabin Tonni, Pedro Faustini, Mark Dras

TL;DR
This paper investigates human perception of adversarial texts, introduces a dataset of human suspiciousness ratings, and develops a model to generate less suspicious adversarial examples for NLP systems.
Contribution
It provides the first dataset of human suspiciousness evaluations for adversarial texts and a regression model to quantify and reduce suspiciousness in adversarial NLP examples.
Findings
Human suspiciousness correlates with detection difficulty.
The regression model effectively predicts suspiciousness scores.
Incorporating suspiciousness scores improves adversarial text generation.
Abstract
Adversarial examples pose a significant challenge to deep neural networks (DNNs) across both image and text domains, with the intent to degrade model performance through meticulously altered inputs. Adversarial texts, however, are distinct from adversarial images due to their requirement for semantic similarity and the discrete nature of the textual contents. This study delves into the concept of human suspiciousness, a quality distinct from the traditional focus on imperceptibility found in image-based adversarial examples. Unlike images, where adversarial changes are meant to be indistinguishable to the human eye, textual adversarial content must often remain undetected or non-suspicious to human readers, even when the text's purpose is to deceive NLP systems or bypass filters. In this research, we expand the study of human suspiciousness by analyzing how individuals perceive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection
MethodsFocus
