Detecting Adversarial Examples

Furkan Mumcu; Yasin Yilmaz

arXiv:2410.17442·cs.LG·June 17, 2025

Detecting Adversarial Examples

Furkan Mumcu, Yasin Yilmaz

PDF

Open Access 5 Reviews

TL;DR

This paper introduces a universal, lightweight detection method for adversarial examples in deep neural networks by analyzing layer output discrepancies, demonstrating high effectiveness across multiple domains.

Contribution

It presents a novel approach that predicts deep-layer features from early layers to detect adversarial samples, improving robustness against evolving attack techniques.

Findings

01

High detection accuracy across image, video, and audio domains

02

Compatible with any DNN architecture

03

Effective against various adversarial attack methods

Abstract

Deep Neural Networks (DNNs) have been shown to be vulnerable to adversarial examples. While numerous successful adversarial attacks have been proposed, defenses against these attacks remain relatively understudied. Existing defense approaches either focus on negating the effects of perturbations caused by the attacks to restore the DNNs' original predictions or use a secondary model to detect adversarial examples. However, these methods often become ineffective due to the continuous advancements in attack techniques. We propose a novel universal and lightweight method to detect adversarial examples by analyzing the layer outputs of DNNs. Our method trains a lightweight regression model that predicts deeper-layer features from early-layer features, and uses the prediction error to detect adversarial samples. Through theoretical justification and extensive experiments, we demonstrate that…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 4

Strengths

1. **Theoretical Foundation.** Mathematical proof supporting the core concept and has a clear theoretical justification for why the method works. 2. **Lightweight:** Using a relatively small MLP for regression makes LR computationally efficient, and suitable for real-time detection.

Weaknesses

See questions.

Reviewer 02Rating 3Confidence 4

Strengths

- The proposed method empirically shows meaningful improvement in detection AUC compared to other baseline defenses among various architectures. - The motivation is well explained and partially justified through theoretical analysis.

Weaknesses

- The proof of theorem 1 seems to miss the important assumption in the referred paper (Goodfellow et al., 2014). Goodfellow et al. claim that the perturbation is linearly amplified as it moves through linear models, but there are no theoretical results for nonlinear models. - The effectiveness of the proposed detector could be further emphasized by investigating their detection performance against attacks that produce adversarial examples with minimal perturbation norms because the tested attack

Reviewer 03Rating 5Confidence 5

Strengths

1. The motivation of the proposed LR is clearly stated. 2. According to the experimental results, LR detects adversarial examples with high efficiency. 3. Experiments in other domains are implemented to prove the universality of LR.

Weaknesses

1. The detection baselines are not strong enough. Some strong baselines, like [1][2][3], are not included, which makes the experimental results less convincing. 2. There is a lack of adaptive attack against LR, and the adaptive attack is important to evaluate the detection performance. 3. According to Section C in the Appendix, the subset of layer vectors needs to be selected for each model, and the dataset may sometimes influence the choice of layers, reducing the practicality of LR. [1] Tian,

Reviewer 04Rating 3Confidence 5

Strengths

Defending against adversarial examples is an important and interesting challenge.

Weaknesses

Unfortunately, it appears unfamiliar with the (vast) literature on this topic and the paper does not present convincing evidence that it will be robust to adaptive attacks. I would particularly recommend the authors begin by reviewing Carlini & Wagner "Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods", and Tramer "Detecting Adversarial Examples Is (Nearly) As Hard As Classifying Them". The latter paper, in particular, shows that the results claimed here would imply

Reviewer 05Rating 1Confidence 5

Strengths

I tried, but it's difficult to write down any points that deserve to be called Strengths for an ICLR-submitted paper.

Weaknesses

The weaknesses of this paper include: - **The proof of Theorem 1 is wrong.** The proof bases on that "Finally, the perturbation aligned with DNN weights is amplified as it sequentially moves through the DNN layers (Goodfellow et al. 2014)", which is an *empirical observation*, not a theoretical conclusion. I was shocked that the authors treat an empirical observation as a formal Theorem, and it's easy to construct a counter-example DNN that violate Theorem 1. - **Non-adaptive evaluations.** In

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Anomaly Detection Techniques and Applications

MethodsFocus