Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?
Maxime M\'eloux, Silviu Maniu, Fran\c{c}ois Portet, Maxime Peyrard

TL;DR
This paper investigates whether mechanistic interpretability explanations for neural networks are unique, revealing systematic non-identifiability and discussing implications for explanation standards in AI.
Contribution
It introduces a formal analysis of the identifiability of MI explanations, demonstrating multiple explanations can fit the same behavior and proposing criteria for explanation validity.
Findings
Multiple circuits can replicate the same behavior
A circuit can have multiple valid interpretations
Different algorithms can align with the same network
Abstract
As AI systems are used in high-stakes applications, ensuring interpretability is crucial. Mechanistic Interpretability (MI) aims to reverse-engineer neural networks by extracting human-understandable algorithms to explain their behavior. This work examines a key question: for a given behavior, and under MI's criteria, does a unique explanation exist? Drawing on identifiability in statistics, where parameters are uniquely inferred under specific assumptions, we explore the identifiability of MI explanations. We identify two main MI strategies: (1) "where-then-what," which isolates a circuit replicating model behavior before interpreting it, and (2) "what-then-where," which starts with candidate algorithms and searches for neural activation subspaces implementing them, using causal alignment. We test both strategies on Boolean functions and small multi-layer perceptrons, fully…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Artificial Intelligence in Healthcare and Education
MethodsALIGN
