Evaluating Explanations: An Explanatory Virtues Framework for   Mechanistic Interpretability -- The Strange Science Part I.ii

Kola Ayonrinde; Louis Jaburi

arXiv:2505.01372·cs.LG·May 5, 2025

Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii

Kola Ayonrinde, Louis Jaburi

PDF

Open Access

TL;DR

This paper introduces a comprehensive framework based on philosophical virtues to evaluate and improve explanations in neural network interpretability, emphasizing the importance of simplicity, unification, and universal principles.

Contribution

It presents a novel Explanatory Virtues Framework for systematically assessing explanations in mechanistic interpretability of neural networks.

Findings

01

Compact Proofs embody multiple explanatory virtues.

02

Framework suggests defining simplicity and unification as key virtues.

03

Universal principles for neural explanations are a promising research direction.

Abstract

Mechanistic Interpretability (MI) aims to understand neural networks through causal explanations. Though MI has many explanation-generating methods, progress has been limited by the lack of a universal approach to evaluating explanations. Here we analyse the fundamental question "What makes a good explanation?" We introduce a pluralist Explanatory Virtues Framework drawing on four perspectives from the Philosophy of Science - the Bayesian, Kuhnian, Deutschian, and Nomological - to systematically evaluate and improve explanations in MI. We find that Compact Proofs consider many explanatory virtues and are hence a promising approach. Fruitful research directions implied by our framework include (1) clearly defining explanatory simplicity, (2) focusing on unifying explanations and (3) deriving universal principles for neural networks. Improved MI methods enhance our ability to monitor,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI)