Everything, Everywhere, All at Once: Is Mechanistic Interpretability   Identifiable?

Maxime M\'eloux; Silviu Maniu; Fran\c{c}ois Portet; Maxime Peyrard

arXiv:2502.20914·cs.LG·March 3, 2025

Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?

Maxime M\'eloux, Silviu Maniu, Fran\c{c}ois Portet, Maxime Peyrard

PDF

Open Access 1 Repo

TL;DR

This paper investigates whether mechanistic interpretability explanations for neural networks are unique, revealing systematic non-identifiability and discussing implications for explanation standards in AI.

Contribution

It introduces a formal analysis of the identifiability of MI explanations, demonstrating multiple explanations can fit the same behavior and proposing criteria for explanation validity.

Findings

01

Multiple circuits can replicate the same behavior

02

A circuit can have multiple valid interpretations

03

Different algorithms can align with the same network

Abstract

As AI systems are used in high-stakes applications, ensuring interpretability is crucial. Mechanistic Interpretability (MI) aims to reverse-engineer neural networks by extracting human-understandable algorithms to explain their behavior. This work examines a key question: for a given behavior, and under MI's criteria, does a unique explanation exist? Drawing on identifiability in statistics, where parameters are uniquely inferred under specific assumptions, we explore the identifiability of MI explanations. We identify two main MI strategies: (1) "where-then-what," which isolates a circuit replicating model behavior before interpreting it, and (2) "what-then-where," which starts with candidate algorithms and searches for neural activation subspaces implementing them, using causal alignment. We test both strategies on Boolean functions and small multi-layer perceptrons, fully…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MelouxM/MI-identifiability
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education

MethodsALIGN