Sparse Autoencoders Enable Scalable and Reliable Circuit Identification   in Language Models

Charles O'Neill; Thang Bui

arXiv:2405.12522·cs.CL·May 22, 2024

Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models

Charles O'Neill, Thang Bui

PDF

Open Access

TL;DR

This paper presents a discrete sparse autoencoder approach that efficiently identifies interpretable circuits in large language models, significantly reducing computational time and data requirements compared to prior methods.

Contribution

The authors introduce a novel autoencoder-based method that enables scalable, reliable circuit identification in language models using minimal examples and without architectural modifications.

Findings

01

Achieves higher precision and recall than state-of-the-art baselines.

02

Reduces runtime from hours to seconds.

03

Requires only 5-10 examples per task.

Abstract

This paper introduces an efficient and robust method for discovering interpretable circuits in large language models using discrete sparse autoencoders. Our approach addresses key limitations of existing techniques, namely computational complexity and sensitivity to hyperparameters. We propose training sparse autoencoders on carefully designed positive and negative examples, where the model can only correctly predict the next token for the positive examples. We hypothesise that learned representations of attention head outputs will signal when a head is engaged in specific computations. By discretising the learned representations into integer codes and measuring the overlap between codes unique to positive examples for each head, we enable direct identification of attention heads involved in circuits without the need for expensive ablations or architectural modifications. On three…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Machine Learning in Materials Science · Machine Learning and Algorithms