A Benchmark for Interpretability Methods in Deep Neural Networks
Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, Been Kim

TL;DR
This paper introduces an empirical benchmark to evaluate the accuracy of feature importance methods in deep neural networks, revealing that most popular methods perform no better than random, with only certain ensemble approaches showing improvement.
Contribution
It provides a systematic benchmark for interpretability methods and highlights the effectiveness of specific ensemble techniques like VarGrad and SmoothGrad-Squared.
Findings
Most interpretability methods are no better than random.
Ensemble methods like VarGrad outperform other approaches.
Some ensemble approaches are computationally expensive without added benefit.
Abstract
We propose an empirical measure of the approximate accuracy of feature importance estimates in deep neural networks. Our results across several large-scale image classification datasets show that many popular interpretability methods produce estimates of feature importance that are not better than a random designation of feature importance. Only certain ensemble based approaches---VarGrad and SmoothGrad-Squared---outperform such a random assignment of importance. The manner of ensembling remains critical, we show that some approaches do no better then the underlying method but carry a far higher computational burden.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Natural Language Processing Techniques
MethodsInterpretability
