Identifying Adversarial Attacks on Text Classifiers

Zhouhang Xie; Jonathan Brophy; Adam Noack; Wencong You; Kalyani; Asthana; Carter Perkins; Sabrina Reis; Sameer Singh; Daniel Lowd

arXiv:2201.08555·cs.CL·January 24, 2022·5 cites

Identifying Adversarial Attacks on Text Classifiers

Zhouhang Xie, Jonathan Brophy, Adam Noack, Wencong You, Kalyani, Asthana, Carter Perkins, Sabrina Reis, Sameer Singh, Daniel Lowd

PDF

Open Access

TL;DR

This paper creates a large dataset of adversarial text attacks, develops classifiers to detect and identify these attacks, and explores features that improve attack forensics for text classifiers.

Contribution

It introduces an extensive dataset of attack instances, benchmarks classifiers for attack detection and identification, and evaluates feature types for attack forensics.

Findings

01

Text properties effectively distinguish attacked texts.

02

Language model features improve attack identification.

03

Target model features reveal attack influence.

Abstract

The landscape of adversarial attacks against text classifiers continues to grow, with new attacks developed every year and many of them available in standard toolkits, such as TextAttack and OpenAttack. In response, there is a growing body of work on robust learning, which reduces vulnerability to these attacks, though sometimes at a high cost in compute time or accuracy. In this paper, we take an alternate approach -- we attempt to understand the attacker by analyzing adversarial text to determine which methods were used to create it. Our first contribution is an extensive dataset for attack detection and labeling: 1.5~million attack instances, generated by twelve adversarial attacks targeting three classifiers trained on six source datasets for sentiment analysis and abuse detection in English. As our second contribution, we use this dataset to develop and benchmark a number of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Misinformation and Its Impacts · Terrorism, Counterterrorism, and Political Violence