Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Lucius Bushnaq; Jake Mendel; Stefan Heimersheim; Dan Braun; Nicholas; Goldowsky-Dill; Kaarel H\"anni; Cindy Wu; Marius Hobbhahn

arXiv:2405.10927·cs.LG·May 21, 2024

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas, Goldowsky-Dill, Kaarel H\"anni, Cindy Wu, Marius Hobbhahn

PDF

Open Access

TL;DR

This paper explores how degeneracy in neural network parameters, such as linear dependencies and shared ReLU activations, can be leveraged to improve mechanistic interpretability by developing invariant representations like the Interaction Basis.

Contribution

It introduces the concept of degeneracy in neural networks, identifies three types, and proposes the Interaction Basis as a new invariant representation for better interpretability.

Findings

01

Degeneracy relates to linear dependence in activations and gradients.

02

Modular networks tend to be more degenerate.

03

Interaction Basis provides a sparsified, invariant network representation.

Abstract

Mechanistic Interpretability aims to reverse engineer the algorithms implemented by neural networks by studying their weights and activations. An obstacle to reverse engineering neural networks is that many of the parameters inside a network are not involved in the computation being implemented by the network. These degenerate parameters may obfuscate internal structure. Singular learning theory teaches us that neural network parameterizations are biased towards being more degenerate, and parameterizations with more degeneracy are likely to generalize further. We identify 3 ways that network parameters can be degenerate: linear dependence between activations in a layer; linear dependence between gradients passed back to a layer; ReLUs which fire on the same subset of datapoints. We also present a heuristic argument that modular networks are likely to be more degenerate, and we develop a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Statistical and Computational Modeling