Tracking Equivalent Mechanistic Interpretations Across Neural Networks

Alan Sun; Mariya Toneva

arXiv:2603.30002·cs.LG·April 1, 2026

Tracking Equivalent Mechanistic Interpretations Across Neural Networks

Alan Sun, Mariya Toneva

PDF

1 Video

TL;DR

This paper introduces a formal framework for assessing whether different neural network models share a common underlying interpretation, addressing key challenges in scaling and defining interpretive equivalence.

Contribution

It formalizes the concept of interpretive equivalence, develops an algorithm to estimate it, and provides theoretical guarantees relating interpretations, circuits, and representations.

Findings

01

Proposed a formal definition of interpretive equivalence.

02

Developed an algorithm to estimate interpretive equivalence.

03

Provided theoretical conditions linking representations and interpretations.

Abstract

Mechanistic interpretability (MI) is an emerging framework for interpreting neural networks. Given a task and model, MI aims to discover a succinct algorithmic process, an interpretation, that explains the model's decision process on that task. However, MI is difficult to scale and generalize. This stems in part from two key challenges: there is no precise notion of a valid interpretation; and, generating interpretations is often an ad hoc process. In this paper, we address these challenges by defining and studying the problem of interpretive equivalence: determining whether two different models share a common interpretation, without requiring an explicit description of what that interpretation is. At the core of our approach, we propose and formalize the principle that two interpretations of a model are equivalent if all of their possible implementations are also equivalent. We develop…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Tracking Equivalent Mechanistic Interpretations Across Neural Networks· slideslive