Tackling Polysemanticity with Neuron Embeddings

Alex Foote

arXiv:2411.08166·cs.LG·November 14, 2024

Tackling Polysemanticity with Neuron Embeddings

Alex Foote

PDF

Open Access

TL;DR

This paper introduces neuron embeddings as a novel method to identify and interpret polysemantic neurons in language models, enhancing understanding of neuron functions and aiding model evaluation.

Contribution

The paper presents a domain- and architecture-agnostic neuron embedding technique that reveals semantic behaviors of neurons, facilitating interpretation and evaluation of neural models.

Findings

01

Neuron embeddings effectively distinguish semantic behaviors in neurons.

02

Application to GPT2-small demonstrates practical utility.

03

Potential to improve evaluation of Sparse Auto-Encoders.

Abstract

We present neuron embeddings, a representation that can be used to tackle polysemanticity by identifying the distinct semantic behaviours in a neuron's characteristic dataset examples, making downstream manual or automatic interpretation much easier. We apply our method to GPT2-small, and provide a UI for exploring the results. Neuron embeddings are computed using a model's internal representations and weights, making them domain and architecture agnostic and removing the risk of introducing external structure which may not reflect a model's actual computation. We describe how neuron embeddings can be used to measure neuron polysemanticity, which could be applied to better evaluate the efficacy of Sparse Auto-Encoders (SAEs).

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications