Validating Mechanistic Interpretations: An Axiomatic Approach
Nils Palumbo, Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina S. Pasareanu, Somesh Jha

TL;DR
This paper introduces an axiomatic framework for validating mechanistic interpretations of neural networks, ensuring they accurately and compositionally approximate the network's semantics, demonstrated through case studies including a Transformer model.
Contribution
It formalizes mechanistic interpretability using axioms inspired by abstract interpretation, providing a rigorous validation method for interpretability studies.
Findings
Axioms effectively validate existing interpretability methods
Framework applied successfully to Transformer models
Demonstrated on 2-SAT problem-solving neural network
Abstract
Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Formal Methods in Verification
MethodsSparse Evolutionary Training
