Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders
Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua, Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang,, Xipeng Qiu

TL;DR
This paper introduces a scalable approach to extract millions of features from Llama-3.1-8B using Sparse Autoencoders, enabling detailed interpretability and analysis of language models.
Contribution
The authors develop and evaluate 256 Sparse Autoencoders trained on all layers of Llama-3.1-8B, including modifications to Top-K SAEs, and provide open-source tools and checkpoints.
Findings
SAEs generalize to longer contexts and fine-tuned models
Feature splitting reveals new features in learned representations
Publicly available SAE checkpoints facilitate interpretability research
Abstract
Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models, yet scalable training remains a significant challenge. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. Modifications to a state-of-the-art SAE variant, Top-K SAEs, are evaluated across multiple dimensions. In particular, we assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models. Additionally, we analyze the geometry of learned SAE latents, confirming that \emph{feature splitting} enables the discovery of new features. The Llama Scope SAE checkpoints are publicly available at~\url{https://huggingface.co/fnlp/Llama-Scope}, alongside our scalable training, interpretation, and visualization tools at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Image Processing and 3D Reconstruction · Generative Adversarial Networks and Image Synthesis
MethodsLLaMA · Balanced Selection · Sparse Autoencoder
