Do Input Gradients Highlight Discriminative Features?

Harshay Shah; Prateek Jain; Praneeth Netrapalli

arXiv:2102.12781·cs.LG·October 27, 2021·1 cites

Do Input Gradients Highlight Discriminative Features?

Harshay Shah, Prateek Jain, Praneeth Netrapalli

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper critically evaluates whether input gradients reliably highlight discriminative features, revealing that standard models often violate this assumption while adversarially robust models satisfy it, supported by empirical, dataset-based, and theoretical analyses.

Contribution

The authors introduce DiffROAR, an evaluation framework, and BlockMNIST, a dataset designed to test interpretability assumptions, providing new tools for auditing gradient-based explanations.

Findings

01

Standard models' input gradients often violate assumption (A).

02

Adversarially robust models' input gradients satisfy assumption (A).

03

Theoretical analysis confirms empirical results on simplified datasets.

Abstract

Post-hoc gradient-based interpretability methods [Simonyan et al., 2013, Smilkov et al., 2017] that provide instance-specific explanations of model predictions are often based on assumption (A): magnitude of input gradients -- gradients of logits with respect to input -- noisily highlight discriminative task-relevant features. In this work, we test the validity of assumption (A) using a three-pronged approach. First, we develop an evaluation framework, DiffROAR, to test assumption (A) on four image classification benchmarks. Our results suggest that (i) input gradients of standard models (i.e., trained on original data) may grossly violate (A), whereas (ii) input gradients of adversarially robust models satisfy (A). Second, we introduce BlockMNIST, an MNIST-based semi-real dataset, that by design encodes a priori knowledge of discriminative features. Our analysis on BlockMNIST leverages…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

harshays/inputgradients
pytorchOfficial

Videos

Do Input Gradients Highlight Discriminative Features?· slideslive

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Machine Learning and Data Classification