Toy Models of Superposition

Nelson Elhage; Tristan Hume; Catherine Olsson; Nicholas Schiefer; Tom; Henighan; Shauna Kravec; Zac Hatfield-Dodds; Robert Lasenby; Dawn Drain,; Carol Chen; Roger Grosse; Sam McCandlish; Jared Kaplan; Dario Amodei; Martin; Wattenberg; Christopher Olah

arXiv:2209.10652·cs.LG·September 23, 2022·42 cites

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom, Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain,, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin, Wattenberg, Christopher Olah

PDF

Open Access 1 Repo

TL;DR

This paper introduces a toy model to understand polysemanticity in neural networks, revealing how superposition of sparse features leads to interpretability challenges, phase transitions, and links to adversarial examples.

Contribution

It provides a fully analyzable toy model demonstrating polysemanticity as superposition, connecting geometric phase transitions to interpretability and adversarial robustness.

Findings

01

Polysemanticity arises from superposition of sparse features.

02

Identifies a phase transition related to model geometry.

03

Links superposition to adversarial vulnerability.

Abstract

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

anthropics/toy-models-of-superposition
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks · Neural Networks and Applications · Explainable Artificial Intelligence (XAI)