The Local Interaction Basis: Identifying Computationally-Relevant and   Sparsely Interacting Features in Neural Networks

Lucius Bushnaq; Stefan Heimersheim; Nicholas Goldowsky-Dill; Dan; Braun; Jake Mendel; Kaarel H\"anni; Avery Griffin; J\"orn St\"ohler,; Magdalena Wache; Marius Hobbhahn

arXiv:2405.10928·cs.LG·May 21, 2024

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq, Stefan Heimersheim, Nicholas Goldowsky-Dill, Dan, Braun, Jake Mendel, Kaarel H\"anni, Avery Griffin, J\"orn St\"ohler,, Magdalena Wache, Marius Hobbhahn

PDF

Open Access 1 Repo

TL;DR

The paper introduces the Local Interaction Basis (LIB), a novel interpretability method that identifies relevant features and interactions in neural networks by transforming activations into a basis aligned with the Jacobian's singular vectors, improving understanding of modular models.

Contribution

LIB is a new basis transformation method that isolates computational features and interactions in neural networks, addressing limitations of existing interpretability techniques.

Findings

01

LIB identifies more relevant features than PCA in tested models.

02

LIB reveals sparser interactions in modular addition and CIFAR-10 models.

03

LIB does not significantly improve interpretability in language models.

Abstract

Mechanistic interpretability aims to understand the behavior of neural networks by reverse-engineering their internal computations. However, current methods struggle to find clear interpretations of neural network activations because a decomposition of activations into computational features is missing. Individual neurons or model components do not cleanly correspond to distinct features or functions. We present a novel interpretability method that aims to overcome this limitation by transforming the activations of the network into a new basis - the Local Interaction Basis (LIB). LIB aims to identify computational features by removing irrelevant activations and interactions. Our method drops irrelevant activation directions and aligns the basis with the singular vectors of the Jacobian matrix between adjacent layers. It also scales features based on their importance for downstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

apolloresearch/rib
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications