Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations
Narges Babadi, Hadis Karimipour

TL;DR
This paper reveals that explanation heatmaps in vision-language models, especially CLIP-based ones, can be manipulated adversarially without changing the model's predictions, exposing a significant trustworthiness vulnerability.
Contribution
The authors introduce X-Shift, a grey-box attack that manipulates explanation heatmaps in CLIP models without affecting predictions, highlighting a fundamental explanation mechanism vulnerability.
Findings
Explanation heatmaps can be systematically manipulated while preserving predictions.
X-Shift effectively redirects explanations to irrelevant regions across multiple architectures.
Standard adversarial attacks do not produce similar explanation shifts, indicating a unique vulnerability.
Abstract
Explanation mechanisms are increasingly used to support transparency and trust in vision-language models (VLMs), particularly in settings where model decisions require human oversight. However, the robustness of these explanations remains insufficiently understood. In this work, we investigate whether explanation heatmaps in VLMs, particularly CLIP-based models, faithfully reflect model reasoning under adversarial conditions. We show that explanation maps can be systematically manipulated while preserving the model's original prediction, revealing a disconnect between predictive behavior and explanation faithfulness. To study this vulnerability, we introduce X-Shift, a novel grey-box attack that perturbs patch-level visual representations to redirect explanation heatmaps toward semantically irrelevant regions without altering the predicted output. Unlike conventional adversarial attacks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
