Toy Models of Superposition
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom, Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain,, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin, Wattenberg, Christopher Olah

TL;DR
This paper introduces a toy model to understand polysemanticity in neural networks, revealing how superposition of sparse features leads to interpretability challenges, phase transitions, and links to adversarial examples.
Contribution
It provides a fully analyzable toy model demonstrating polysemanticity as superposition, connecting geometric phase transitions to interpretability and adversarial robustness.
Findings
Polysemanticity arises from superposition of sparse features.
Identifies a phase transition related to model geometry.
Links superposition to adversarial vulnerability.
Abstract
Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Neural Networks and Applications · Explainable Artificial Intelligence (XAI)
