Neuron to Graph: Interpreting Language Model Neurons at Scale
Alex Foote, Neel Nanda, Esben Kran, Ioannis Konstas, Shay Cohen, Fazl, Barez

TL;DR
This paper presents Neuron to Graph (N2G), a scalable automated method for interpreting individual neurons in large language models by translating their behavior into visualizable graphs, enhancing interpretability and safety.
Contribution
Introduces N2G, a novel automated tool that scales neuron interpretability by converting neuron behaviors into graphs, enabling comprehensive analysis of large language models.
Findings
N2G accurately predicts neuron activations better than baseline methods.
The method scales to a 6-layer Transformer model using a single GPU.
Graphs facilitate manual and automated interpretation of neuron functions.
Abstract
Advances in Large Language Models (LLMs) have led to remarkable capabilities, yet their inner mechanisms remain largely unknown. To understand these models, we need to unravel the functions of individual neurons and their contribution to the network. This paper introduces a novel automated approach designed to scale interpretability techniques across a vast array of neurons within LLMs, to make them more interpretable and ultimately safe. Conventional methods require examination of examples with strong neuron activation and manual identification of patterns to decipher the concepts a neuron responds to. We propose Neuron to Graph (N2G), an innovative tool that automatically extracts a neuron's behaviour from the dataset it was trained on and translates it into an interpretable graph. N2G uses truncation and saliency methods to emphasise only the most pertinent tokens to a neuron while…
Peer Reviews
Decision·Submitted to ICLR 2024
1. This paper proposes an automated method to interpret the behavior of neurons in language model by constructing a token tree. The visualization of token tree facilitates the interpretability of neurons and identification of neurons of interest. 2. The method can be easily scale to large language models. 3. In experiment, the effectiveness of the method is validated by comparing with two other methods. In addition, the writing and presentation of this paper is good.
1. The paper mentions that the interpretability of neurons in deep layers is poor, but it does not provide any examples of poorly explained neurons. 2. This work is similar to [1], as both explore the behavior of neurons based on their activations to different tokens. This paper automates the interpretation of neurons using a graph, but does not significantly improve the interpretability of language model neurons. The differences with [1] need to be further explained in detail. 3. Current popula
1. A timely topic of explaining large language models. 2. The proposed method can provide some useful information of LLM neurons. The pruning step and augmentation step make sense to me. 3. The method is intuitive and easy to follow.
1. Only using SoLU is an problem. We want to explain those models that are heavily used in practice, but not some testbeds. 2. The whole process is pretty ad-hoc. There is no rigorous definition of explanation. I would say this work is more like a post-hoc analysis, instead of research. 3. I can hardly call the result as a "graph", since it is just a pivot node with some context nodes. It IS a graph, but a very limited one.
A. The paper addresses an important problem: interpreting neurons in large language models, which can have implications for mechanistic interpretability, bias detection, and model safety. B. The paper also provides a way to measure the quality of the generated neuron graphs by comparing them to the ground truth activations of neurons. C. The paper is well-written and organized, with clear motivation, methodology, and experiments. It includes several figures and tables that illustrate the prop
A. No external validation of the method is done to show out of distribution generalization. Ground truth activation prediction is performed on the same model, and the same dataset that was used to create the Trie of highly activating substrings. This is a major limitation. B. The authors acknowledge in the introduction that one problem in looking at highly activating samples in a dataset is that it provides an illusion of interpretability, but they do not address this problem. They augment the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning in Materials Science · Adversarial Robustness in Machine Learning
MethodsMulti-Head Attention · Attention Is All You Need · Dropout · Residual Connection · Linear Layer · Layer Normalization · Byte Pair Encoding · Softmax · Label Smoothing · Absolute Position Encodings
