Sparse Autoencoders Enable Scalable and Reliable Circuit Identification in Language Models
Charles O'Neill, Thang Bui

TL;DR
This paper presents a discrete sparse autoencoder approach that efficiently identifies interpretable circuits in large language models, significantly reducing computational time and data requirements compared to prior methods.
Contribution
The authors introduce a novel autoencoder-based method that enables scalable, reliable circuit identification in language models using minimal examples and without architectural modifications.
Findings
Achieves higher precision and recall than state-of-the-art baselines.
Reduces runtime from hours to seconds.
Requires only 5-10 examples per task.
Abstract
This paper introduces an efficient and robust method for discovering interpretable circuits in large language models using discrete sparse autoencoders. Our approach addresses key limitations of existing techniques, namely computational complexity and sensitivity to hyperparameters. We propose training sparse autoencoders on carefully designed positive and negative examples, where the model can only correctly predict the next token for the positive examples. We hypothesise that learned representations of attention head outputs will signal when a head is engaged in specific computations. By discretising the learned representations into integer codes and measuring the overlap between codes unique to positive examples for each head, we enable direct identification of attention heads involved in circuits without the need for expensive ablations or architectural modifications. On three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Machine Learning in Materials Science · Machine Learning and Algorithms
