Sparse Autoencoders Find Highly Interpretable Features in Language   Models

Hoagy Cunningham; Aidan Ewart; Logan Riggs; Robert Huben; Lee Sharkey

arXiv:2309.08600·cs.LG·October 5, 2023·41 cites

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey

PDF

Open Access 3 Repos 1 Video

TL;DR

This paper introduces a scalable, unsupervised method using sparse autoencoders to identify interpretable, monosemantic features in language models, helping to resolve superposition and improve model interpretability.

Contribution

The authors demonstrate that sparse autoencoders can find more interpretable features in language models than previous methods, aiding in understanding neural network internals.

Findings

01

Sparse autoencoders learn interpretable, monosemantic features.

02

Features identified are causally linked to model behavior.

03

Method outperforms alternative approaches in interpretability.

Abstract

One of the roadblocks to a better understanding of neural networks' internals is \textit{polysemanticity}, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is \textit{superposition}, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Sparse Autoencoders Find Highly Interpretable Features in Language Models· slideslive

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Adversarial Robustness in Machine Learning