Universal Neurons in GPT2 Language Models

Wes Gurnee; Theo Horsley; Zifan Carl Guo; Tara Rezaei Kheirkhah; Qinyi; Sun; Will Hathaway; Neel Nanda; Dimitris Bertsimas

arXiv:2401.12181·cs.LG·January 23, 2024·2 cites

Universal Neurons in GPT2 Language Models

Wes Gurnee, Theo Horsley, Zifan Carl Guo, Tara Rezaei Kheirkhah, Qinyi, Sun, Will Hathaway, Neel Nanda, Dimitris Bertsimas

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper investigates whether individual neurons in GPT2 models trained from different initializations are universal, finding that a small subset of neurons are consistent across models and have interpretable, functional roles.

Contribution

The study demonstrates the existence of universal neurons across GPT2 models and characterizes their interpretability and functional roles, advancing mechanistic understanding.

Findings

01

1-5% of neurons are universal across models

02

Universal neurons are interpretable and form a small taxonomy

03

Universal neurons influence attention, entropy, and token prediction

Abstract

A basic question within the emerging field of mechanistic interpretability is the degree to which neural networks learn the same underlying mechanisms. In other words, are neural mechanisms universal across different models? In this work, we study the universality of individual neurons across GPT2 models trained from different initial random seeds, motivated by the hypothesis that universal neurons are likely to be interpretable. In particular, we compute pairwise correlations of neuron activations over 100 million tokens for every neuron pair across five different seeds and find that 1-5\% of neurons are universal, that is, pairs of neurons which consistently activate on the same inputs. We then study these universal neurons in detail, finding that they usually have clear interpretations and taxonomize them into a small number of neuron families. We conclude by studying patterns in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wesg52/universal-neurons
jaxOfficial

Models

🤗
google/gemma-scope-2b-pt-transcoders
model· ♡ 13
♡ 13

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Topic Modeling