A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i
Kola Ayonrinde, Louis Jaburi

TL;DR
This paper advocates for a principled approach to neural network interpretability through causal explanations, defining the field, its criteria, and discussing its limits and foundational principles.
Contribution
It introduces the Explanatory View Hypothesis, formalizes Mechanistic Interpretability, and proposes the Principle of Explanatory Optimism as a foundational concept.
Findings
Explanatory Faithfulness is well-defined.
Mechanistic Interpretability involves causal, model-level explanations.
The limits of MI are characterized and discussed.
Abstract
Mechanistic Interpretability aims to understand neural networks through causal explanations. We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability research is a principled approach to understanding models because neural networks contain implicit explanations which can be extracted and understood. We hence show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined. We propose a definition of Mechanistic Interpretability (MI) as the practice of producing Model-level, Ontic, Causal-Mechanistic, and Falsifiable explanations of neural networks, allowing us to distinguish MI from other interpretability paradigms and detail MI's inherent limits. We formulate the Principle of Explanatory Optimism, a conjecture which we argue is a necessary precondition for the success of Mechanistic Interpretability.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhilosophy and History of Science · Statistical and Computational Modeling
