Universal Adversarial Triggers for Attacking and Analyzing NLP

Eric Wallace; Shi Feng; Nikhil Kandpal; Matt Gardner; Sameer Singh

arXiv:1908.07125·cs.CL·January 5, 2021

Universal Adversarial Triggers for Attacking and Analyzing NLP

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, Sameer Singh

PDF

1 Repo

TL;DR

This paper introduces universal adversarial triggers that can manipulate NLP models across tasks, revealing vulnerabilities and biases, and providing insights into model behavior through input-agnostic triggers.

Contribution

The paper proposes a gradient-guided method to find universal triggers that attack NLP models and analyze their behavior, demonstrating transferability and interpretability.

Findings

01

Triggers drastically reduce model accuracy on targeted tasks.

02

Triggers transfer across different models and tasks.

03

Triggers reveal dataset biases and model heuristics.

Abstract

Adversarial examples highlight model vulnerabilities and are useful for evaluation and interpretation. We define universal adversarial triggers: input-agnostic sequences of tokens that trigger a model to produce a specific prediction when concatenated to any input from a dataset. We propose a gradient-guided search over tokens which finds short trigger sequences (e.g., one word for classification and four words for language modeling) that successfully trigger the target prediction. For example, triggers cause SNLI entailment accuracy to drop from 89.94% to 0.55%, 72% of "why" questions in SQuAD to be answered "to kill american people", and the GPT-2 language model to spew racist output even when conditioned on non-racial contexts. Furthermore, although the triggers are optimized using white-box access to a specific model, they transfer to other models for all tasks we consider. Finally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Eric-Wallace/universal-triggers
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Methods7 Fastest Ways to Call American Airlines Reservations Number (USA Guide) · Linear Layer · Cosine Annealing · Residual Connection · Attention Dropout · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections