Llama Scope: Extracting Millions of Features from Llama-3.1-8B with   Sparse Autoencoders

Zhengfu He; Wentao Shu; Xuyang Ge; Lingjie Chen; Junxuan Wang; Yunhua; Zhou; Frances Liu; Qipeng Guo; Xuanjing Huang; Zuxuan Wu; Yu-Gang Jiang,; Xipeng Qiu

arXiv:2410.20526·cs.LG·October 29, 2024·3 cites

Llama Scope: Extracting Millions of Features from Llama-3.1-8B with Sparse Autoencoders

Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua, Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang,, Xipeng Qiu

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces a scalable approach to extract millions of features from Llama-3.1-8B using Sparse Autoencoders, enabling detailed interpretability and analysis of language models.

Contribution

The authors develop and evaluate 256 Sparse Autoencoders trained on all layers of Llama-3.1-8B, including modifications to Top-K SAEs, and provide open-source tools and checkpoints.

Findings

01

SAEs generalize to longer contexts and fine-tuned models

02

Feature splitting reveals new features in learned representations

03

Publicly available SAE checkpoints facilitate interpretability research

Abstract

Sparse Autoencoders (SAEs) have emerged as a powerful unsupervised method for extracting sparse representations from language models, yet scalable training remains a significant challenge. We introduce a suite of 256 SAEs, trained on each layer and sublayer of the Llama-3.1-8B-Base model, with 32K and 128K features. Modifications to a state-of-the-art SAE variant, Top-K SAEs, are evaluated across multiple dimensions. In particular, we assess the generalizability of SAEs trained on base models to longer contexts and fine-tuned models. Additionally, we analyze the geometry of learned SAE latents, confirming that \emph{feature splitting} enables the discovery of new features. The Llama Scope SAE checkpoints are publicly available at~\url{https://huggingface.co/fnlp/Llama-Scope}, alongside our scalable training, interpretation, and visualization tools at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

openmoss/language-model-saes
jaxOfficial

Models

🤗
OpenMOSS-Team/Llama-Scope
model· ♡ 25
♡ 25

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Image Processing and 3D Reconstruction · Generative Adversarial Networks and Image Synthesis

MethodsLLaMA · Balanced Selection · Sparse Autoencoder