Tackling Polysemanticity with Neuron Embeddings
Alex Foote

TL;DR
This paper introduces neuron embeddings as a novel method to identify and interpret polysemantic neurons in language models, enhancing understanding of neuron functions and aiding model evaluation.
Contribution
The paper presents a domain- and architecture-agnostic neuron embedding technique that reveals semantic behaviors of neurons, facilitating interpretation and evaluation of neural models.
Findings
Neuron embeddings effectively distinguish semantic behaviors in neurons.
Application to GPT2-small demonstrates practical utility.
Potential to improve evaluation of Sparse Auto-Encoders.
Abstract
We present neuron embeddings, a representation that can be used to tackle polysemanticity by identifying the distinct semantic behaviours in a neuron's characteristic dataset examples, making downstream manual or automatic interpretation much easier. We apply our method to GPT2-small, and provide a UI for exploring the results. Neuron embeddings are computed using a model's internal representations and weights, making them domain and architecture agnostic and removing the risk of introducing external structure which may not reflect a model's actual computation. We describe how neuron embeddings can be used to measure neuron polysemanticity, which could be applied to better evaluate the efficacy of Sparse Auto-Encoders (SAEs).
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
