Input-specific Attention Subnetworks for Adversarial Detection
Emil Biju, Anirudh Sriram, Pratyush Kumar, Mitesh M Khapra

TL;DR
This paper introduces input-specific attention subnetworks derived from Transformer models to effectively detect adversarial inputs, significantly improving detection accuracy across multiple datasets and attack types.
Contribution
The paper proposes a novel method to use attention heads for adversarial detection, achieving state-of-the-art results and robustness with limited training data.
Findings
Over 7.5% improvement in detection accuracy on BERT across 10 datasets.
More accurate detection with larger models.
Effective even with modest adversarial training sets.
Abstract
Self-attention heads are characteristic of Transformer models and have been well studied for interpretability and pruning. In this work, we demonstrate an altogether different utility of attention heads, namely for adversarial detection. Specifically, we propose a method to construct input-specific attention subnetworks (IAS) from which we extract three features to discriminate between authentic and adversarial inputs. The resultant detector significantly improves (by over 7.5%) the state-of-the-art adversarial detection accuracy for the BERT encoder on 10 NLU datasets with 11 different adversarial attack types. We also demonstrate that our method (a) is more accurate for larger models which are likely to have more spurious correlations and thus vulnerable to adversarial attack, and (b) performs well even with modest training sets of adversarial examples.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications · Explainable Artificial Intelligence (XAI)
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Layer Normalization · Adam · Absolute Position Encodings
