"That Is a Suspicious Reaction!": Interpreting Logits Variation to   Detect NLP Adversarial Attacks

Edoardo Mosca; Shreyash Agarwal; Javier Rando; Georg Groh

arXiv:2204.04636·cs.AI·June 30, 2023

"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks

Edoardo Mosca, Shreyash Agarwal, Javier Rando, Georg Groh

PDF

1 Repo

TL;DR

This paper introduces a model-agnostic detector that analyzes logits variation to identify adversarial text inputs in NLP, significantly improving detection accuracy and generalization across models and datasets.

Contribution

It presents a novel logits-based detection method for NLP adversarial attacks that outperforms existing techniques and generalizes well across various models and datasets.

Findings

01

Improves state-of-the-art adversarial detection accuracy

02

Demonstrates strong cross-model and cross-dataset generalization

03

Effective against multiple word-level attack types

Abstract

Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

javirandor/wdr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.