A Mathematical Philosophy of Explanations in Mechanistic   Interpretability -- The Strange Science Part I.i

Kola Ayonrinde; Louis Jaburi

arXiv:2505.00808·cs.LG·May 5, 2025

A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i

Kola Ayonrinde, Louis Jaburi

PDF

Open Access

TL;DR

This paper advocates for a principled approach to neural network interpretability through causal explanations, defining the field, its criteria, and discussing its limits and foundational principles.

Contribution

It introduces the Explanatory View Hypothesis, formalizes Mechanistic Interpretability, and proposes the Principle of Explanatory Optimism as a foundational concept.

Findings

01

Explanatory Faithfulness is well-defined.

02

Mechanistic Interpretability involves causal, model-level explanations.

03

The limits of MI are characterized and discussed.

Abstract

Mechanistic Interpretability aims to understand neural networks through causal explanations. We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability research is a principled approach to understanding models because neural networks contain implicit explanations which can be extracted and understood. We hence show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined. We propose a definition of Mechanistic Interpretability (MI) as the practice of producing Model-level, Ontic, Causal-Mechanistic, and Falsifiable explanations of neural networks, allowing us to distinguish MI from other interpretability paradigms and detail MI's inherent limits. We formulate the Principle of Explanatory Optimism, a conjecture which we argue is a necessary precondition for the success of Mechanistic Interpretability.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhilosophy and History of Science · Statistical and Computational Modeling