Trace and Detect Adversarial Attacks on CNNs using Feature Response Maps
Mohammadreza Amirian, Friedhelm Schwenker, Thilo Stadelmann

TL;DR
This paper introduces a novel, human-interpretable method for detecting adversarial attacks on CNNs by analyzing feature response maps and using entropy measures, effective against state-of-the-art attacks on ImageNet.
Contribution
It proposes a new detection technique that tracks adversarial perturbations in feature responses without modifying CNN architecture, enhancing security against attacks.
Findings
Effective detection of adversarial examples on large-scale CNNs
Method is fully human-interpretable and does not alter network architecture
Validated against state-of-the-art attacks on ImageNet
Abstract
The existence of adversarial attacks on convolutional neural networks (CNN) questions the fitness of such models for serious applications. The attacks manipulate an input image such that misclassification is evoked while still looking normal to a human observer -- they are thus not easily detectable. In a different context, backpropagated activations of CNN hidden layers -- "feature responses" to a given input -- have been helpful to visualize for a human "debugger" what the CNN "looks at" while computing its output. In this work, we propose a novel detection method for adversarial examples to prevent attacks. We do so by tracking adversarial perturbations in feature responses, allowing for automatic detection using average local spatial entropy. The method does not alter the original network architecture and is fully human-interpretable. Experiments confirm the validity of our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
