The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Lucius Bushnaq, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan, Braun, Jake Mendel, Kaarel H\"anni, Avery Griffin, J\"orn St\"ohler,, Magdalena Wache, Marius Hobbhahn

TL;DR
The paper introduces the Local Interaction Basis (LIB), a novel interpretability method that identifies relevant features and interactions in neural networks by transforming activations into a basis aligned with the Jacobian's singular vectors, improving understanding of modular models.
Contribution
LIB is a new basis transformation method that isolates computational features and interactions in neural networks, addressing limitations of existing interpretability techniques.
Findings
LIB identifies more relevant features than PCA in tested models.
LIB reveals sparser interactions in modular addition and CIFAR-10 models.
LIB does not significantly improve interpretability in language models.
Abstract
Mechanistic interpretability aims to understand the behavior of neural networks by reverse-engineering their internal computations. However, current methods struggle to find clear interpretations of neural network activations because a decomposition of activations into computational features is missing. Individual neurons or model components do not cleanly correspond to distinct features or functions. We present a novel interpretability method that aims to overcome this limitation by transforming the activations of the network into a new basis - the Local Interaction Basis (LIB). LIB aims to identify computational features by removing irrelevant activations and interactions. Our method drops irrelevant activation directions and aligns the basis with the singular vectors of the Jacobian matrix between adjacent layers. It also scales features based on their importance for downstream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
