Do Input Gradients Highlight Discriminative Features?
Harshay Shah, Prateek Jain, Praneeth Netrapalli

TL;DR
This paper critically evaluates whether input gradients reliably highlight discriminative features, revealing that standard models often violate this assumption while adversarially robust models satisfy it, supported by empirical, dataset-based, and theoretical analyses.
Contribution
The authors introduce DiffROAR, an evaluation framework, and BlockMNIST, a dataset designed to test interpretability assumptions, providing new tools for auditing gradient-based explanations.
Findings
Standard models' input gradients often violate assumption (A).
Adversarially robust models' input gradients satisfy assumption (A).
Theoretical analysis confirms empirical results on simplified datasets.
Abstract
Post-hoc gradient-based interpretability methods [Simonyan et al., 2013, Smilkov et al., 2017] that provide instance-specific explanations of model predictions are often based on assumption (A): magnitude of input gradients -- gradients of logits with respect to input -- noisily highlight discriminative task-relevant features. In this work, we test the validity of assumption (A) using a three-pronged approach. First, we develop an evaluation framework, DiffROAR, to test assumption (A) on four image classification benchmarks. Our results suggest that (i) input gradients of standard models (i.e., trained on original data) may grossly violate (A), whereas (ii) input gradients of adversarially robust models satisfy (A). Second, we introduce BlockMNIST, an MNIST-based semi-real dataset, that by design encodes a priori knowledge of discriminative features. Our analysis on BlockMNIST leverages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Machine Learning and Data Classification
