Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition
Dan Braun, Lucius Bushnaq, Stefan Heimersheim, Jake Mendel, Lee, Sharkey

TL;DR
This paper introduces Attribution-based Parameter Decomposition (APD), a novel method for decomposing neural network parameters into minimal, faithful, and simple components to enhance mechanistic interpretability.
Contribution
It presents APD, a new approach that optimally decomposes neural network parameters into mechanistic components with minimal description length, validated on toy models.
Findings
Successfully identified ground truth mechanisms in toy models
Recovered features from superposition and separated computations
Provided a foundation for minimal circuit identification
Abstract
Mechanistic interpretability aims to understand the internal mechanisms learned by neural networks. Despite recent progress toward this goal, it remains unclear how best to decompose neural network parameters into mechanistic components. We introduce Attribution-based Parameter Decomposition (APD), a method that directly decomposes a neural network's parameters into components that (i) are faithful to the parameters of the original network, (ii) require a minimal number of components to process any input, and (iii) are maximally simple. Our approach thus optimizes for a minimal length description of the network's mechanisms. We demonstrate APD's effectiveness by successfully identifying ground truth mechanisms in multiple toy experimental settings: Recovering features from superposition; separating compressed computations; and identifying cross-layer distributed representations. While…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
