A Few Large Shifts: Layer-Inconsistency Based Minimal Overhead Adversarial Example Detection

Sanggeon Yun; Ryozo Masukawa; Hyunwoo Oh; Nathaniel D. Bastian; Mohsen Imani

arXiv:2505.12586·cs.LG·October 3, 2025

A Few Large Shifts: Layer-Inconsistency Based Minimal Overhead Adversarial Example Detection

Sanggeon Yun, Ryozo Masukawa, Hyunwoo Oh, Nathaniel D. Bastian, Mohsen Imani

PDF

Open Access 1 Repo 4 Reviews

TL;DR

This paper presents a lightweight, layer-inconsistency based detection method for adversarial examples in deep neural networks, achieving state-of-the-art results with minimal overhead by exploiting localized violations of layer-wise Lipschitz continuity.

Contribution

It introduces a novel detection framework leveraging internal layer inconsistencies, requiring only benign data, and grounded in the A Few Large Shifts assumption, with two strategies RT and LT.

Findings

01

Achieves state-of-the-art detection accuracy on CIFAR-10, CIFAR-100, and ImageNet.

02

Operates with negligible computational overhead.

03

Provides a formal lower-bound guarantee for detection threshold selection.

Abstract

Deep neural networks (DNNs) are highly susceptible to adversarial examples--subtle, imperceptible perturbations that can lead to incorrect predictions. While detection-based defenses offer a practical alternative to adversarial training, many existing methods depend on external models, complex architectures, or adversarial data, limiting their efficiency and generalizability. We introduce a lightweight, plug-in detection framework that leverages internal layer-wise inconsistencies within the target model itself, requiring only benign data for calibration. Our approach is grounded in the A Few Large Shifts Assumption, which posits that adversarial perturbations induce large, localized violations of layer-wise Lipschitz continuity in a small subset of layers. Building on this, we propose two complementary strategies--Recovery Testing (RT) and Logit-layer Testing (LT)--to empirically…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1- Clear, self-contained detection paradigm. The method uses only the target network’s layer traces and benign calibration—no adversarial data, no external SSL encoder, no kNN graphs. The diagram on p. 3 (Fig. 2) cleanly shows RT/LT and their fusion. 2- Well-articulated mechanism + measurable proxies. The FLS assumption is made operational via layer-wise reconstruction errors (RT) and logit-vs-feature sensitivity ratios (LT), tied to local Lipschitz language and formalized in App. B with theore

Weaknesses

1- Threat-model precision and attack breadth. While Orthogonal-PGD, end-to-end PGD on RLT, AutoAttack, and SimBA are included, the main tables emphasize ℓ∞ (and one ℓ2 ablation). Missing: EOT against any stochasticity in the learned augmentations, multi-restart ablations for PGD (step-size/steps/restarts grids), and transfer from surrogate models in a detection sense. 2- Assumptions around RT "invertibility." RT presumes the existence of approximate inverses from $z_L$ to earlier $z_k$ (Assump

Reviewer 02Rating 6Confidence 3

Strengths

1. The RLT framework it proposes is based on the A Few Large Shifts Assumption and includes a clear comparison with existing methods. 2. The paper is clearly written, the figures and tables are easy to understand, and the overall flow of the text is good. 3. The paper demonstrates advantages in both accuracy and efficiency compared to existing methods, and its effectiveness is validated through experiments.

Weaknesses

1. The paper could benefit from further quantification of its core assumption. Although the phenomenon is empirically demonstrated through RT and LT, the work lacks a stronger theoretical or mathematical proof to explain why perturbations cause this local and disproportionate damage in "a small subset of layers" rather than being uniformly distributed across all layers. 2. The applicability of the proposed framework, particularly the Recovery Testing (RT) module, is limited by its training dat

Reviewer 03Rating 6Confidence 3

Strengths

- The paper is very clear and well-written. - The authors justify each choice behind the design of their defense. - The research problem is still open, and the provided contribution is relevant in this sense. - The experimental evaluation shows a clear improvement with respect to the considered competing approaches.

Weaknesses

- The provided robust accuracy under worst-case adaptive attackers reveals that the defense, in such a scenario (which is very relevant when considering security-related applications), is quite weak. - (minor) Additionally, I'm a bit skeptical about adversarial example detectors, as they often have been broken by well-crafted attacks that are able to overcome the defense mechanism. The authors

Reviewer 04Rating 2Confidence 5

Strengths

Detecting adversarial inputs remains a challenging problem.

Weaknesses

The paper relegates to the appendix important issues such as the setting of the hyperparameters tau (line 156). Also, where does the "Lipschitz" parameters D come from in Assumption 1? (For that matter, clearly define an adversarial input at this point.) The fonts in Fig 1 and the ables are too small. Within the text, some of the mathematical expressions are hard to discern, e.g., softmax on line 164 (which I don't think is the standard definition). L_RT involves a "benign" dataset x_n of N

Code & Models

Repositories

c0510gy/AFLS-AED
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Network Security and Intrusion Detection · Anomaly Detection Techniques and Applications