Detecting Adversarial Examples
Furkan Mumcu, Yasin Yilmaz

TL;DR
This paper introduces a universal, lightweight detection method for adversarial examples in deep neural networks by analyzing layer output discrepancies, demonstrating high effectiveness across multiple domains.
Contribution
It presents a novel approach that predicts deep-layer features from early layers to detect adversarial samples, improving robustness against evolving attack techniques.
Findings
High detection accuracy across image, video, and audio domains
Compatible with any DNN architecture
Effective against various adversarial attack methods
Abstract
Deep Neural Networks (DNNs) have been shown to be vulnerable to adversarial examples. While numerous successful adversarial attacks have been proposed, defenses against these attacks remain relatively understudied. Existing defense approaches either focus on negating the effects of perturbations caused by the attacks to restore the DNNs' original predictions or use a secondary model to detect adversarial examples. However, these methods often become ineffective due to the continuous advancements in attack techniques. We propose a novel universal and lightweight method to detect adversarial examples by analyzing the layer outputs of DNNs. Our method trains a lightweight regression model that predicts deeper-layer features from early-layer features, and uses the prediction error to detect adversarial samples. Through theoretical justification and extensive experiments, we demonstrate that…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. **Theoretical Foundation.** Mathematical proof supporting the core concept and has a clear theoretical justification for why the method works. 2. **Lightweight:** Using a relatively small MLP for regression makes LR computationally efficient, and suitable for real-time detection.
See questions.
- The proposed method empirically shows meaningful improvement in detection AUC compared to other baseline defenses among various architectures. - The motivation is well explained and partially justified through theoretical analysis.
- The proof of theorem 1 seems to miss the important assumption in the referred paper (Goodfellow et al., 2014). Goodfellow et al. claim that the perturbation is linearly amplified as it moves through linear models, but there are no theoretical results for nonlinear models. - The effectiveness of the proposed detector could be further emphasized by investigating their detection performance against attacks that produce adversarial examples with minimal perturbation norms because the tested attack
1. The motivation of the proposed LR is clearly stated. 2. According to the experimental results, LR detects adversarial examples with high efficiency. 3. Experiments in other domains are implemented to prove the universality of LR.
1. The detection baselines are not strong enough. Some strong baselines, like [1][2][3], are not included, which makes the experimental results less convincing. 2. There is a lack of adaptive attack against LR, and the adaptive attack is important to evaluate the detection performance. 3. According to Section C in the Appendix, the subset of layer vectors needs to be selected for each model, and the dataset may sometimes influence the choice of layers, reducing the practicality of LR. [1] Tian,
Defending against adversarial examples is an important and interesting challenge.
Unfortunately, it appears unfamiliar with the (vast) literature on this topic and the paper does not present convincing evidence that it will be robust to adaptive attacks. I would particularly recommend the authors begin by reviewing Carlini & Wagner "Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods", and Tramer "Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them". The latter paper, in particular, shows that the results claimed here would imply
I tried, but it's difficult to write down any points that deserve to be called Strengths for an ICLR-submitted paper.
The weaknesses of this paper include: - **The proof of Theorem 1 is wrong.** The proof bases on that "Finally, the perturbation aligned with DNN weights is amplified as it sequentially moves through the DNN layers (Goodfellow et al. 2014)", which is an *empirical observation*, not a theoretical conclusion. I was shocked that the authors treat an empirical observation as a formal Theorem, and it's easy to construct a counter-example DNN that violate Theorem 1. - **Non-adaptive evaluations.** In
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Anomaly Detection Techniques and Applications
MethodsFocus
