Neural Fingerprints for Adversarial Attack Detection
Haim Fisher, Moni Shahar, Yehezkel S. Resheff

TL;DR
This paper introduces neural fingerprints and a randomized detection method to improve adversarial attack detection in image classifiers, achieving near-perfect detection rates on ImageNet.
Contribution
The paper proposes a novel neural fingerprinting approach combined with randomization to enhance adversarial attack detection beyond static defenses.
Findings
Near-perfect detection rates on ImageNet
Effective against multiple attack methods
Low false positive rates
Abstract
Deep learning models for image classification have become standard tools in recent years. A well known vulnerability of these models is their susceptibility to adversarial examples. These are generated by slightly altering an image of a certain class in a way that is imperceptible to humans but causes the model to classify it wrongly as another class. Many algorithms have been proposed to address this problem, falling generally into one of two categories: (i) building robust classifiers (ii) directly detecting attacked images. Despite the good performance of these detectors, we argue that in a white-box setting, where the attacker knows the configuration and weights of the network and the detector, they can overcome the detector by running many examples on a local copy, and sending only those that were not detected to the actual model. This problem is common in security applications…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The paper points out that existing adversarial detection methods are not fully white-box. The paper reasons that a single detector scenario is not sufficient. If random unknown detectors are used, it will be hard for an attacker to generate a successful adversarial example to bypass the detection method. - The paper brings an interesting insight into fingerprint distribution in Fig. 1, where fingerprints of clean images and attacked ones are less overlapped, showing the separation ability.
- The paper explains the proposed ideas with the analogy of detectors, and detectors and fingerprints seem interchangeably used. This causes a bit of confusion. To improve clarity, please provide explicit definitions of "detectors" and "fingerprints" early in the paper, and to consistently use these terms throughout. This would help readers better understand the key concepts. - In Section 2.2, BaRT is not proposed by Tramer et al. (It is Raff et al. CVPR 2019). - The paper stated that the propos
Their method shows strong performance on ImageNet and larger DNNs architectures with high detection rates and low false positives using standard deep learning models. The paper suggests future improvements through likelihood ratio tests and boosting frameworks, while noting the approach could extend beyond image classification to other domains.
- 022: The motivation is unclear to me: Why is it common security setup o to have complete knowledge? think that companies try to protect models against model stealing and make it difficult to have full access of the model. It is ok to have complete white-box setting but the statement that is common in security does not seem adequate to me. - Introduction/Related work: The threat “adversarial examples” is not a new phenomenon. The first attack (FGSM) was published in 2014/2015 by Goodfellow
The idea is novel and I like that the defense includes some element of randomization. In general randomization as the underlying property of a defense seems to be a slightly more solid basis for security (although I will mention why that is not always the case in the weakness section). Having tested on ImageNet is a very important metric for any new adversarial defense.
I will preface my critique of the paper by saying that I don’t think this is an inherently bad work, however there are a number of critical issues that need to be done to strength the paper. Issue 1: The threat model/white-box attack analysis is utterly lacking. The authors claim to test on IFGSM/PGD but this is a VERY old white-box attack. Why do the authors not test on more state-of-the-art white-box attacks like APGD? https://arxiv.org/pdf/2003.01690 Issue 2: There are no black-box attacks
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
