Detecting Textual Adversarial Examples Based on Distributional Characteristics of Data Representations
Na Liu, Mark Dras, Wei Emma Zhang

TL;DR
This paper introduces two reactive detection methods for textual adversarial examples in NLP, based on distributional properties of data representations, achieving state-of-the-art results across multiple attack levels and datasets.
Contribution
The paper proposes two novel detection techniques, including a new method (MDRE), based on distributional characteristics, filling a gap in reactive NLP adversarial defense.
Findings
Adapted LID achieves state-of-the-art detection performance.
MDRE outperforms existing baselines on multiple datasets.
Both methods effectively detect various levels of textual adversarial attacks.
Abstract
Although deep neural networks have achieved state-of-the-art performance in various machine learning tasks, adversarial examples, constructed by adding small non-random perturbations to correctly classified inputs, successfully fool highly expressive deep classifiers into incorrect predictions. Approaches to adversarial attacks in natural language tasks have boomed in the last five years using character-level, word-level, phrase-level, or sentence-level textual perturbations. While there is some work in NLP on defending against such attacks through proactive methods, like adversarial training, there is to our knowledge no effective general reactive approaches to defence via detection of textual adversarial examples such as is found in the image processing literature. In this paper, we propose two new reactive methods for NLP to fill this gap, which unlike the few limited application…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
