eXIAA: eXplainable Injections for Adversarial Attack
Leonardo Pesce, Jiawen Wei, Gianmarco Mengaldo

TL;DR
This paper introduces a new black-box, model-agnostic adversarial attack on explainability methods in image classification, capable of significantly altering explanations without changing model predictions, exposing vulnerabilities in current XAI techniques.
Contribution
We propose a novel single-step attack that modifies explanations in post-hoc XAI methods without model access, highlighting critical security flaws.
Findings
The attack can dramatically change explanations while preserving model predictions.
It requires only access to model predictions and explanations, not model weights.
The method is effective on popular models like ResNet-18 and ViT-B16 on ImageNet.
Abstract
Post-hoc explainability methods are a subset of Machine Learning (ML) that aim to provide a reason for why a model behaves in a certain way. In this paper, we show a new black-box model-agnostic adversarial attack for post-hoc explainable Artificial Intelligence (XAI), particularly in the image domain. The goal of the attack is to modify the original explanations while being undetected by the human eye and maintain the same predicted class. In contrast to previous methods, we do not require any access to the model or its weights, but only to the model's computed predictions and explanations. Additionally, the attack is accomplished in a single step while significantly changing the provided explanations, as demonstrated by empirical evaluation. The low requirements of our method expose a critical vulnerability in current explainability methods, raising concerns about their reliability in…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper addresses an important and underexplored vulnerability in XAI systems. The fragility of explanations in safety-critical applications like medicine is a genuine concern, and demonstrating this vulnerability is valuable. The practical threat model is a key strength—requiring only access to predictions and explanations (no model weights or gradients) makes this attack significantly more realistic than prior work requiring white-box access or iterative optimization. The single-step nature
The technical novelty is limited. The three-phase pipeline (select runner-up image, extract top features, alpha-blend) is relatively straightforward and lacks theoretical depth. The method essentially performs a weighted average of pixels guided by saliency this is more of an engineering contribution than a fundamental algorithmic advance. The paper would benefit from deeper analysis of why this simple approach works so effectively. The threat model requires clarification. The assumption that a
This paper presents a new attack method for deep neural networks. The method is simple but effective. Experimental results show that the explanation methods such as saliency maps, integrated gradients, and DeepLIFT SHAP are affected, which can be a potential risk of AI.
The proposed approach is very simple and seems straightforward. In other words, the proposed method seems ad-hoc and empirical with no theoretical backups. Therefore, technical depth is rather weak. The authors may want to include mathematical analysis on why such an attack is possible. Besides, Figure 5 shows that the explanation change induced by the images of the running-up class (full lines) is always as good as or better than picking an attack image from any other class, which is also empir
- The method is simple
- You claim that your method is operating in a more realistic setting where you do not have access to model internals (e.g. gradients, weights, etc.). But; actually, you require explenations of the model. Thus; in fact, you _do_ need access to the model internals. I find this very weak - and this realization kind of defeats the novelty of this paper in my opinion. One way you could potentially get around this, is by using a surrogate model - e.g. say you train a VGG16 on Imagenet, but instead us
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Advanced Neural Network Applications
