Concept-based Adversarial Attack: a Probabilistic Perspective

Andi Zhang; Xuan Ding; Steven McDonagh; Samuel Kaski

arXiv:2507.02965·cs.CV·March 2, 2026

Concept-based Adversarial Attack: a Probabilistic Perspective

Andi Zhang, Xuan Ding, Steven McDonagh, Samuel Kaski

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a probabilistic, concept-based adversarial attack framework that generates diverse, concept-preserving adversarial examples by operating on concept distributions, improving attack diversity and efficiency.

Contribution

It extends adversarial attacks to operate on concept distributions, enabling more diverse and concept-preserving adversarial examples with a principled probabilistic approach.

Findings

01

More diverse adversarial examples generated

02

Higher attack efficiency achieved

03

Effective preservation of original concept

Abstract

We propose a concept-based adversarial attack framework that extends beyond single-image perturbations by adopting a probabilistic perspective. Rather than modifying a single image, our method operates on an entire concept - represented by a distribution - to generate diverse adversarial examples. Preserving the concept is essential, as it ensures that the resulting adversarial images remain identifiable as instances of the original underlying category or identity. By sampling from this concept-based adversarial distribution, we generate images that maintain the original concept but vary in pose, viewpoint, or background, thereby misleading the classifier. Mathematically, this framework remains consistent with traditional adversarial attacks in a principled manner. Our theoretical and empirical results demonstrate that concept-based adversarial attacks yield more diverse adversarial…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

The main contributions (and strengths) of the paper are: * Novel formulation: First work to define adversarial distance at the concept level rather than per-image, enabling more semantically meaningful attacks. * Strong empirical results: Achieves state-of-the-art targeted attack success rates (e.g., 97.82% white-box on ResNet-50) while better preserving concept identity (validated via user studies and CLIP scores). * Theoretical justification: Provides analysis showing that expanding the dis

Weaknesses

Computational cost: The proposed approach requires fine-tuning generative models per concept, which is time-consuming (≈8 hours/concept) and limits scalability. *Limited transferability. While the experimental results show strong performance on white-box attacks, black-box transfer success remains low (though better than baselines), especially under strict top-1 metrics. * Concept definition ambiguity. The proposed Relies on user-provided image sets or fine-tuned models to define a “concept,”

Reviewer 02Rating 6Confidence 4

Strengths

1. The presented framework is a clean and well motivated generalization of the probabilistic framework presented in Zhang et al. The idea of moving away from an image-centric distance distribution to a concept-prior through the use of finetuned diffusion models is inspired. 2. Empirical performance of the given approach is encouraging, and the results support the authors' claims of better, and more semantically meaningful adversarial examples as compared to methods like DiffAttack. 3. Implemen

Weaknesses

1. The theoretical contributions are mostly incremental with both Thm.1 and 2 being straightforward algebra. While supportive of the presented conceptual framework, it does not really provide any additional insight on how the approach can be further optimized or adapted to specific PGMs like diffusion models. 2. The transferability results are extremely low.This suggests very low overlap between $p_{vic}$ and $p_{dis}$ which is a bit counterintuitive given the strong performance of these classi

Reviewer 03Rating 4Confidence 4

Strengths

1. The problem is well-motivated. The paper goes beyond traditional single-image or class-level attacks and instead enables identity-level, concept-aware adversarial generation that produces realistic and semantically consistent examples. Furthermore, I believe this approach could be valuable beyond adversarial attacks. It may help future work probe model hallucination and understand the semantic priors that models rely on. 2. The method is built on a clear probabilistic formulation rather than

Weaknesses

1. **Missing compute / FLOP parity.** The appendix briefly reports compute but does not provide a clear, quantitative comparison of **FLOPs / GPU-hours** between the proposed pipeline and the baselines. Please report wall-clock GPU-hours and/or FLOP counts per concept (training + sampling) for the proposed method and for each baseline. This will help readers judge whether performance gains are due to algorithmic novelty or to much greater compute and data budgets. 2. **Experimental

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning