Residual Stream Analysis with Multi-Layer SAEs

Tim Lawson; Lucy Farnik; Conor Houghton; Laurence Aitchison

arXiv:2409.04185·cs.LG·February 25, 2025

Residual Stream Analysis with Multi-Layer SAEs

Tim Lawson, Lucy Farnik, Conor Houghton, Laurence Aitchison

PDF

Open Access 1 Repo 10 Models 3 Reviews

TL;DR

This paper introduces multi-layer sparse autoencoders (MLSAEs) trained on residual streams from all transformer layers to better understand information flow and representation changes across layers in language models.

Contribution

The paper presents MLSAEs as a novel single autoencoder approach that captures residual stream dynamics across all layers, enabling new insights into layer-specific activation patterns.

Findings

01

Latents are often active at a single layer per token, but this varies across tokens.

02

Variance in latent activation distributions over layers is much higher across tokens than within a single token.

03

Larger models show increased multi-layer latent activity, aligning with more similar residual streams across layers.

Abstract

Sparse autoencoders (SAEs) are a promising approach to interpreting the internal representations of transformer language models. However, SAEs are usually trained separately on each transformer layer, making it difficult to use them to study how information flows across layers. To solve this problem, we introduce the multi-layer SAE (MLSAE): a single SAE trained on the residual stream activation vectors from every transformer layer. Given that the residual stream is understood to preserve information across layers, we expected MLSAE latents to 'switch on' at a token position and remain active at later layers. Interestingly, we find that individual latents are often active at a single layer for a given token or prompt, but the layer at which an individual latent is active may differ for different tokens or prompts. We quantify these phenomena by defining a distribution over layers and…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

The work experiments with Pythia models (Transformers) of different sizes, making the work highly reproducible. The experiments are sound and most of the doubts I had while reading were answered already in the text. While the paper doesn't offer revolutionary insights on its own, it is an excellent improvement to sparse autoencoders and may be very useful for mechanistic interpretability. The model also tests the relative performance of MLSAE at different expansion factors and sparsity levels an

Weaknesses

1. The biggest weakness that I see is the lack of direct experimental comparison to previous techniques. I'd expect a paper to compare MLSAE and existing SAE techniques, similar to the comparison in Figure 5 of MLSAE w/ and w/o tuned lens. For example, I think it would be possible to show a version of Figure 1 (and almost all other figures, too) using previous SAE techniques, even without training on most of those layers. This would give the reader a qualitative understanding of whether MLSAE pr

Reviewer 02Rating 6Confidence 4

Strengths

- The paper is well written. - The experimental evaluation is well done. - The research area of the paper is mechanistic interpretability, which is a potentially very useful for understanding complex models and important for AI safety.

Weaknesses

1. Some things might be unclear (on the first reading): - this layer may differ for different tokens or prompts -> which layer it is might differ ... ? - Under multiple figures: "expected value of the layer" -> "average layer index" or "expected value of the layer index" - Eq. (3): The meaning of $\operatorname{Var}$ could be clarified. - In some places, the notation is a bit inconsistent or non-standard. - In section 3.1, $\bf x\in\mathbb R^d$, but in Eq. (3), $\bf x$ represents a sequenc

Reviewer 03Rating 6Confidence 2

Strengths

1. This work proposes the MLSAE method to analyze the interpretability of information transmitted through residual flow in Transformers. 2. The author only uses one SAE in multi-layer Transformers. 3. The author analyzes the fraction and variance of latents of models of different sizes through detailed experiments.

Weaknesses

1. As indicated in both the Abstract and Conclusion sections, the author investigates the transformation of representations across various layers within Transformers. However, the analysis is limited to a statistical perspective, focusing solely on the fractions and variances of representations at different layers. The linkage between this statistical overview and a deeper understanding of the underlying mechanisms of Transformers is not explicitly articulated. It remains unclear to me how these

Code & Models

Repositories

tim-lawson/mlsae
jaxOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFault Detection and Control Systems · Anomaly Detection Techniques and Applications

MethodsSparse Autoencoder