Towards falsifiable interpretability research
Matthew L. Leavitt, Ari Morcos

TL;DR
This paper critiques current interpretability methods for deep neural networks, highlighting their reliance on intuition and lack of falsifiability, and proposes a framework for more robust, evidence-based interpretability research.
Contribution
It introduces a framework for falsifiable interpretability research, encouraging hypothesis-driven methods to improve robustness and validity in understanding DNNs.
Findings
Current interpretability methods often rely on intuition.
Falsifiability can improve robustness of interpretability.
Proposed framework promotes evidence-based insights.
Abstract
Methods for understanding the decisions of and mechanisms underlying deep neural networks (DNNs) typically rely on building intuition by emphasizing sensory or semantic features of individual examples. For instance, methods aim to visualize the components of an input which are "important" to a network's decision, or to measure the semantic properties of single neurons. Here, we argue that interpretability research suffers from an over-reliance on intuition-based approaches that risk-and in some cases have caused-illusory progress and misleading conclusions. We identify a set of limitations that we argue impede meaningful progress in interpretability research, and examine two popular classes of interpretability methods-saliency and single-neuron-based approaches-that serve as case studies for how overreliance on intuition and lack of falsifiability can undermine interpretability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Anomaly Detection Techniques and Applications
MethodsInterpretability
