Attacks Meet Interpretability: Attribute-steered Detection of   Adversarial Samples

Guanhong Tao; Shiqing Ma; Yingqi Liu; Xiangyu Zhang

arXiv:1810.11580·cs.LG·October 30, 2018·52 cites

Attacks Meet Interpretability: Attribute-steered Detection of Adversarial Samples

Guanhong Tao, Shiqing Ma, Yingqi Liu, Xiangyu Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces an interpretability-based method for detecting adversarial samples in face recognition models by analyzing neuron-attribute relationships, achieving high detection accuracy across multiple attack types.

Contribution

It proposes a novel bi-directional inference approach linking attributes and neurons to identify adversarial inputs, improving detection accuracy over existing methods.

Findings

01

94% detection accuracy on 7 attack types

02

9.91% false positive rate on benign inputs

03

outperforms feature squeezing technique

Abstract

Adversarial sample attacks perturb benign inputs to induce DNN misbehaviors. Recent research has demonstrated the widespread presence and the devastating consequences of such attacks. Existing defense techniques either assume prior knowledge of specific attacks or may not work well on complex models due to their underlying assumptions. We argue that adversarial sample attacks are deeply entangled with interpretability of DNN models: while classification results on benign inputs can be reasoned based on the human perceptible features/attributes, results on adversarial samples can hardly be explained. Therefore, we propose a novel adversarial sample detection technique for face recognition models, based on interpretability. It features a novel bi-directional correspondence inference between attributes and internal neurons to identify neurons critical for individual attributes. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AmIAttribute/AmI
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning

MethodsInterpretability