Flash Interpretability: Decoding Specialised Feature Neurons in Large   Language Models with the LM-Head

Harry J Davies

arXiv:2501.02688·cs.CL·March 3, 2025

Flash Interpretability: Decoding Specialised Feature Neurons in Large Language Models with the LM-Head

Harry J Davies

PDF

Open Access

TL;DR

This paper introduces a method to interpret large language models by decoding neuron weights into token probabilities via the LM-head, enabling quick identification of specialized neurons like 'dog' or 'California' and their influence on outputs.

Contribution

The authors present a novel technique to directly decode neuron functions into token probabilities using the LM-head, facilitating rapid interpretability of large language models.

Findings

01

Over 75% of neurons in instruct model's up-projection layers are associated with the same top token as in the pretrained model.

02

Clamping specific neurons like 'dog' influences the model to consistently discuss related concepts.

03

Decoding the features of all neurons in Llama 3.1 8B takes less than 10 seconds with minimal compute.

Abstract

Large Language Models (LLMs) typically have billions of parameters and are thus often difficult to interpret in their operation. In this work, we demonstrate that it is possible to decode neuron weights directly into token probabilities through the final projection layer of the model (the LM-head). This is illustrated in Llama 3.1 8B where we use the LM-head to find examples of specialised feature neurons such as a "dog" neuron and a "California" neuron, and we validate this by clamping these neurons to affect the probability of the concept in the output. We evaluate this method on both the pre-trained and Instruct models, finding that over 75% of neurons in the up-projection layers in the instruct model have the same top associated token compared to the pretrained model. Finally, we demonstrate that clamping the "dog" neuron leads the instruct model to always discuss dogs when asked…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications

MethodsLinear Layer · LLaMA