From Neurons to Neutrons: A Case Study in Interpretability
Ouail Kitouni, Niklas Nolte, V\'ictor Samuel P\'erez-D\'iaz, Sokratis, Trifinopoulos, Mike Williams

TL;DR
This paper explores how mechanistic interpretability can reveal low-dimensional, human-understandable representations in neural networks, demonstrated through models trained on nuclear physics data, offering new insights into model behavior.
Contribution
It shows that neural networks learn low-dimensional representations that align with domain knowledge, enhancing interpretability beyond simple prediction tasks.
Findings
Neural networks develop low-dimensional, interpretable representations.
These representations can be aligned with human domain knowledge.
Interpretability techniques can yield new scientific insights.
Abstract
Mechanistic Interpretability (MI) promises a path toward fully understanding how neural networks make their predictions. Prior work demonstrates that even when trained to perform simple arithmetic, models can implement a variety of algorithms (sometimes concurrently) depending on initialization and hyperparameters. Does this mean neuron-level interpretability techniques have limited applicability? We argue that high-dimensional neural networks can learn low-dimensional representations of their training data that are useful beyond simply making good predictions. Such representations can be understood through the mechanistic interpretability lens and provide insights that are surprisingly faithful to human-derived domain knowledge. This indicates that such approaches to interpretability can be useful for deriving a new understanding of a problem from models trained to solve it. As a case…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
