Interpretability Needs a New Paradigm
Andreas Madsen, Himabindu Lakkaraju, Siva Reddy, Sarath Chandar

TL;DR
This paper argues that interpretability in AI needs new paradigms beyond intrinsic and post-hoc, emphasizing faithfulness and proposing three emerging paradigms to improve explanation reliability.
Contribution
The paper introduces three novel paradigms for interpretability, focusing on designing models that enhance faithfulness and explanation quality.
Findings
Current paradigms have limitations in ensuring faithfulness.
Three emerging paradigms are proposed for better interpretability.
Evolving paradigms can improve trustworthiness of AI explanations.
Abstract
Interpretability is the study of explaining models in understandable terms to humans. At present, interpretability is divided into two paradigms: the intrinsic paradigm, which believes that only models designed to be explained can be explained, and the post-hoc paradigm, which believes that black-box models can be explained. At the core of this debate is how each paradigm ensures its explanations are faithful, i.e., true to the model's behavior. This is important, as false but convincing explanations lead to unsupported confidence in artificial intelligence (AI), which can be dangerous. This paper's position is that we should think about new paradigms while staying vigilant regarding faithfulness. First, by examining the history of paradigms in science, we see that paradigms are constantly evolving. Then, by examining the current paradigms, we can understand their underlying beliefs,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInterpreting and Communication in Healthcare
