A Vulnerability of Attribution Methods Using Pre-Softmax Scores
Miguel Lerma, Mirtha Lucas

TL;DR
This paper reveals a vulnerability in attribution methods for CNN classifiers, showing that small model modifications can significantly affect explanations without changing the model's output.
Contribution
It identifies a new vulnerability where attribution explanations can be manipulated by minor model changes, independent of output alterations.
Findings
Small model modifications can alter attribution explanations
Attribution methods are vulnerable without changing model outputs
Highlights a need for more robust explanation techniques
Abstract
We discuss a vulnerability involving a category of attribution methods used to provide explanations for the outputs of convolutional neural networks working as classifiers. It is known that this type of networks are vulnerable to adversarial attacks, in which imperceptible perturbations of the input may alter the outputs of the model. In contrast, here we focus on effects that small modifications in the model may cause on the attribution method without altering the model outputs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Anomaly Detection Techniques and Applications
MethodsFocus
