Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
Kola Ayonrinde, Louis Jaburi

TL;DR
This paper introduces a comprehensive framework based on philosophical virtues to evaluate and improve explanations in neural network interpretability, emphasizing the importance of simplicity, unification, and universal principles.
Contribution
It presents a novel Explanatory Virtues Framework for systematically assessing explanations in mechanistic interpretability of neural networks.
Findings
Compact Proofs embody multiple explanatory virtues.
Framework suggests defining simplicity and unification as key virtues.
Universal principles for neural explanations are a promising research direction.
Abstract
Mechanistic Interpretability (MI) aims to understand neural networks through causal explanations. Though MI has many explanation-generating methods, progress has been limited by the lack of a universal approach to evaluating explanations. Here we analyse the fundamental question "What makes a good explanation?" We introduce a pluralist Explanatory Virtues Framework drawing on four perspectives from the Philosophy of Science - the Bayesian, Kuhnian, Deutschian, and Nomological - to systematically evaluate and improve explanations in MI. We find that Compact Proofs consider many explanatory virtues and are hence a promising approach. Fruitful research directions implied by our framework include (1) clearly defining explanatory simplicity, (2) focusing on unifying explanations and (3) deriving universal principles for neural networks. Improved MI methods enhance our ability to monitor,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI)
