Beating Attackers At Their Own Games: Adversarial Example Detection Using Adversarial Gradient Directions
Yuhang Wu, Sunpreet S. Arora, Yanhong Wu, Hao Yang

TL;DR
This paper introduces a novel adversarial example detection method that leverages the directions of adversarial gradients, demonstrating high accuracy and efficiency across multiple datasets and attack types.
Contribution
The paper proposes a new detection approach based on adversarial gradient directions, which is more efficient and effective than existing methods that rely on multiple perturbations.
Findings
Achieves over 97% AUC-ROC on CIFAR-10
Achieves over 98% AUC-ROC on ImageNet
Outperforms several state-of-the-art detection methods
Abstract
Adversarial examples are input examples that are specifically crafted to deceive machine learning classifiers. State-of-the-art adversarial example detection methods characterize an input example as adversarial either by quantifying the magnitude of feature variations under multiple perturbations or by measuring its distance from estimated benign example distribution. Instead of using such metrics, the proposed method is based on the observation that the directions of adversarial gradients when crafting (new) adversarial examples play a key role in characterizing the adversarial space. Compared to detection methods that use multiple perturbations, the proposed method is efficient as it only applies a single random perturbation on the input example. Experiments conducted on two different databases, CIFAR-10 and ImageNet, show that the proposed detection method achieves, respectively,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Anomaly Detection Techniques and Applications
