Validating Mechanistic Interpretations: An Axiomatic Approach

Nils Palumbo; Ravi Mangal; Zifan Wang; Saranya Vijayakumar; Corina S. Pasareanu; Somesh Jha

arXiv:2407.13594·cs.LG·June 24, 2025·1 cites

Validating Mechanistic Interpretations: An Axiomatic Approach

Nils Palumbo, Ravi Mangal, Zifan Wang, Saranya Vijayakumar, Corina S. Pasareanu, Somesh Jha

PDF

Open Access 1 Video

TL;DR

This paper introduces an axiomatic framework for validating mechanistic interpretations of neural networks, ensuring they accurately and compositionally approximate the network's semantics, demonstrated through case studies including a Transformer model.

Contribution

It formalizes mechanistic interpretability using axioms inspired by abstract interpretation, providing a rigorous validation method for interpretability studies.

Findings

01

Axioms effectively validate existing interpretability methods

02

Framework applied successfully to Transformer models

03

Demonstrated on 2-SAT problem-solving neural network

Abstract

Mechanistic interpretability aims to reverse engineer the computation performed by a neural network in terms of its internal components. Although there is a growing body of research on mechanistic interpretation of neural networks, the notion of a mechanistic interpretation itself is often ad-hoc. Inspired by the notion of abstract interpretation from the program analysis literature that aims to develop approximate semantics for programs, we give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner. We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Validating Mechanistic Interpretations: An Axiomatic Approach· slideslive

Taxonomy

TopicsModel-Driven Software Engineering Techniques · Formal Methods in Verification

MethodsSparse Evolutionary Training