Concept-based Adversarial Attack: a Probabilistic Perspective
Andi Zhang, Xuan Ding, Steven McDonagh, Samuel Kaski

TL;DR
This paper introduces a probabilistic, concept-based adversarial attack framework that generates diverse, concept-preserving adversarial examples by operating on concept distributions, improving attack diversity and efficiency.
Contribution
It extends adversarial attacks to operate on concept distributions, enabling more diverse and concept-preserving adversarial examples with a principled probabilistic approach.
Findings
More diverse adversarial examples generated
Higher attack efficiency achieved
Effective preservation of original concept
Abstract
We propose a concept-based adversarial attack framework that extends beyond single-image perturbations by adopting a probabilistic perspective. Rather than modifying a single image, our method operates on an entire concept - represented by a distribution - to generate diverse adversarial examples. Preserving the concept is essential, as it ensures that the resulting adversarial images remain identifiable as instances of the original underlying category or identity. By sampling from this concept-based adversarial distribution, we generate images that maintain the original concept but vary in pose, viewpoint, or background, thereby misleading the classifier. Mathematically, this framework remains consistent with traditional adversarial attacks in a principled manner. Our theoretical and empirical results demonstrate that concept-based adversarial attacks yield more diverse adversarial…
Peer Reviews
Decision·ICLR 2026 Poster
The main contributions (and strengths) of the paper are: * Novel formulation: First work to define adversarial distance at the concept level rather than per-image, enabling more semantically meaningful attacks. * Strong empirical results: Achieves state-of-the-art targeted attack success rates (e.g., 97.82% white-box on ResNet-50) while better preserving concept identity (validated via user studies and CLIP scores). * Theoretical justification: Provides analysis showing that expanding the dis
Computational cost: The proposed approach requires fine-tuning generative models per concept, which is time-consuming (≈8 hours/concept) and limits scalability. *Limited transferability. While the experimental results show strong performance on white-box attacks, black-box transfer success remains low (though better than baselines), especially under strict top-1 metrics. * Concept definition ambiguity. The proposed Relies on user-provided image sets or fine-tuned models to define a “concept,”
1. The presented framework is a clean and well motivated generalization of the probabilistic framework presented in Zhang et al. The idea of moving away from an image-centric distance distribution to a concept-prior through the use of finetuned diffusion models is inspired. 2. Empirical performance of the given approach is encouraging, and the results support the authors' claims of better, and more semantically meaningful adversarial examples as compared to methods like DiffAttack. 3. Implemen
1. The theoretical contributions are mostly incremental with both Thm.1 and 2 being straightforward algebra. While supportive of the presented conceptual framework, it does not really provide any additional insight on how the approach can be further optimized or adapted to specific PGMs like diffusion models. 2. The transferability results are extremely low.This suggests very low overlap between $p_{vic}$ and $p_{dis}$ which is a bit counterintuitive given the strong performance of these classi
1. The problem is well-motivated. The paper goes beyond traditional single-image or class-level attacks and instead enables identity-level, concept-aware adversarial generation that produces realistic and semantically consistent examples. Furthermore, I believe this approach could be valuable beyond adversarial attacks. It may help future work probe model hallucination and understand the semantic priors that models rely on. 2. The method is built on a clear probabilistic formulation rather than
1. **Missing compute / FLOP parity.** The appendix briefly reports compute but does not provide a clear, quantitative comparison of **FLOPs / GPU-hours** between the proposed pipeline and the baselines. Please report wall-clock GPU-hours and/or FLOP counts per concept (training + sampling) for the proposed method and for each baseline. This will help readers judge whether performance gains are due to algorithmic novelty or to much greater compute and data budgets. 2. **Experimental
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
