When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability
ML Nissen Gonzalez, Melwina Albuquerque, Laurence Wroe, Jacob Meyer Cohen, Logan Riggs Smith, Thomas Dooms

TL;DR
The paper introduces a tensor similarity metric for mechanistic interpretability that is invariant to weight-space symmetries and effectively captures functional equivalence in neural networks.
Contribution
It proposes a novel weight-based tensor similarity measure that addresses limitations of existing metrics by being symmetry-invariant and capturing global mechanisms.
Findings
Tensor similarity tracks training dynamics like grokking and backdoor insertion more accurately.
The metric reduces similarity verification to an algebraic problem.
It is efficient and accounts for cross-layer mechanisms.
Abstract
Mechanistic interpretability aims to break models into meaningful parts; verifying that two such parts implement the same computation is a prerequisite. Existing similarity measures evaluate either empirical behaviour, leaving them blind to out-of-distribution mechanisms, or basis-dependent parameters, meaning they disregard weight-space symmetries. To address these issues for the class of tensor-based models, we introduce a weight-based metric, tensor similarity, that is invariant to such symmetries. This metric captures global functional equivalence and accounts for cross-layer mechanisms using an efficient recursive algorithm. Empirically, tensor similarity tracks functional training dynamics, such as grokking and backdoor insertion, with higher fidelity than existing metrics. This reduces measuring similarity and verifying faithfulness into a solved algebraic problem rather than one…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
