From Neurons to Neutrons: A Case Study in Interpretability

Ouail Kitouni; Niklas Nolte; V\'ictor Samuel P\'erez-D\'iaz; Sokratis; Trifinopoulos; Mike Williams

arXiv:2405.17425·cs.LG·May 28, 2024

From Neurons to Neutrons: A Case Study in Interpretability

Ouail Kitouni, Niklas Nolte, V\'ictor Samuel P\'erez-D\'iaz, Sokratis, Trifinopoulos, Mike Williams

PDF

Open Access 1 Repo

TL;DR

This paper explores how mechanistic interpretability can reveal low-dimensional, human-understandable representations in neural networks, demonstrated through models trained on nuclear physics data, offering new insights into model behavior.

Contribution

It shows that neural networks learn low-dimensional representations that align with domain knowledge, enhancing interpretability beyond simple prediction tasks.

Findings

01

Neural networks develop low-dimensional, interpretable representations.

02

These representations can be aligned with human domain knowledge.

03

Interpretability techniques can yield new scientific insights.

Abstract

Mechanistic Interpretability (MI) promises a path toward fully understanding how neural networks make their predictions. Prior work demonstrates that even when trained to perform simple arithmetic, models can implement a variety of algorithms (sometimes concurrently) depending on initialization and hyperparameters. Does this mean neuron-level interpretability techniques have limited applicability? We argue that high-dimensional neural networks can learn low-dimensional representations of their training data that are useful beyond simply making good predictions. Such representations can be understood through the mechanistic interpretability lens and provide insights that are surprisingly faithful to human-derived domain knowledge. This indicates that such approaches to interpretability can be useful for deriving a new understanding of a problem from models trained to solve it. As a case…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

samuelperezdi/nuclr-icml
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques