A Vulnerability of Attribution Methods Using Pre-Softmax Scores

Miguel Lerma; Mirtha Lucas

arXiv:2307.03305·cs.LG·April 10, 2024

A Vulnerability of Attribution Methods Using Pre-Softmax Scores

Miguel Lerma, Mirtha Lucas

PDF

Open Access 1 Repo

TL;DR

This paper reveals a vulnerability in attribution methods for CNN classifiers, showing that small model modifications can significantly affect explanations without changing the model's output.

Contribution

It identifies a new vulnerability where attribution explanations can be manipulated by minor model changes, independent of output alterations.

Findings

01

Small model modifications can alter attribution explanations

02

Attribution methods are vulnerable without changing model outputs

03

Highlights a need for more robust explanation techniques

Abstract

We discuss a vulnerability involving a category of attribution methods used to provide explanations for the outputs of convolutional neural networks working as classifiers. It is known that this type of networks are vulnerable to adversarial attacks, in which imperceptible perturbations of the input may alter the outputs of the model. In contrast, here we focus on effects that small modifications in the model may cause on the attribution method without altering the model outputs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mlerma54/adversarial-attacks-on-saliency-maps
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Anomaly Detection Techniques and Applications

MethodsFocus