Interpretability in Parameter Space: Minimizing Mechanistic Description   Length with Attribution-based Parameter Decomposition

Dan Braun; Lucius Bushnaq; Stefan Heimersheim; Jake Mendel; Lee; Sharkey

arXiv:2501.14926·cs.LG·February 11, 2025

Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition

Dan Braun, Lucius Bushnaq, Stefan Heimersheim, Jake Mendel, Lee, Sharkey

PDF

Open Access

TL;DR

This paper introduces Attribution-based Parameter Decomposition (APD), a novel method for decomposing neural network parameters into minimal, faithful, and simple components to enhance mechanistic interpretability.

Contribution

It presents APD, a new approach that optimally decomposes neural network parameters into mechanistic components with minimal description length, validated on toy models.

Findings

01

Successfully identified ground truth mechanisms in toy models

02

Recovered features from superposition and separated computations

03

Provided a foundation for minimal circuit identification

Abstract

Mechanistic interpretability aims to understand the internal mechanisms learned by neural networks. Despite recent progress toward this goal, it remains unclear how best to decompose neural network parameters into mechanistic components. We introduce Attribution-based Parameter Decomposition (APD), a method that directly decomposes a neural network's parameters into components that (i) are faithful to the parameters of the original network, (ii) require a minimal number of components to process any input, and (iii) are maximally simple. Our approach thus optimizes for a minimal length description of the network's mechanisms. We demonstrate APD's effectiveness by successfully identifying ground truth mechanisms in multiple toy experimental settings: Recovering features from superposition; separating compressed computations; and identifying cross-layer distributed representations. While…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques