Mechanistic Permutability: Match Features Across Layers

Nikita Balagansky; Ian Maksimov; Daniil Gavrilov

arXiv:2410.07656·cs.LG·March 4, 2025

Mechanistic Permutability: Match Features Across Layers

Nikita Balagansky, Ian Maksimov, Daniil Gavrilov

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces SAE Match, a data-free method for aligning features across layers in neural networks, enhancing understanding of feature evolution and persistence in deep models.

Contribution

SAE Match is a novel, data-free technique that aligns features across layers by minimizing parameter differences, improving interpretability of feature dynamics.

Findings

01

Effectively captures feature evolution across layers.

02

Features persist over multiple layers.

03

Can approximate hidden states across layers.

Abstract

Understanding how features evolve across layers in deep neural networks is a fundamental challenge in mechanistic interpretability, particularly due to polysemanticity and feature superposition. While Sparse Autoencoders (SAEs) have been used to extract interpretable features from individual layers, aligning these features across layers has remained an open problem. In this paper, we introduce SAE Match, a novel, data-free method for aligning SAE features across different layers of a neural network. Our approach involves matching features by minimizing the mean squared error between the folded parameters of SAEs, a technique that incorporates activation thresholds into the encoder and decoder weights to account for differences in feature scales. Through extensive experiments on the Gemma 2 language model, we demonstrate that our method effectively captures feature evolution across…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 4

Strengths

- The paper's SAE Match technique appears to work well at finding corresponding SAE features between layers, which can contribute to the goal of mapping a "feature circuit" (as in Marks et al. (2024) https://arxiv.org/pdf/2403.19647). This technique is data-free, meaning only the SAE weights are required, and not any model or SAE activations. - The paper also provides useful empirical evidence that SAEs find features between layers that are simultaneously 1) close in the space of parameters, an

Weaknesses

- The authors get far worse results on layers 0-9 of the model than on layers 10-25, indicating that the technique may not fully generalize. The authors claim that this is to be expected, saying "This phenomenon aligns with findings from previous research. Gurnee et al. (2023) also reported increased polysemanticity in the early layers of neural networks." This explanation is unsatisfactory because Gurnee et al. (2023) were working with LLM neurons, not SAE features. Additionally, Cunningham et

Reviewer 02Rating 5Confidence 3

Strengths

Main finding: Cosine similarity alone is not a great proxy for late layers, as residual stream norms increase. The authors propose parameter folding, which effectively addresses this problem for JumpReLU SAEs. Current work relies on cosine similarity, and I am convinced the field should adopt this proposed technique.

Weaknesses

### Critiques that can be addressed in this paper - I am unsure whether the original hypothesis of permutation is answered. The term "matching" implies a binary measure of whether a feature mapping is true or false. This might require the introduction of a cutoff threshold, or applying a clustering technique. Otherwise, the framing of similarity measures might be clearer than permutation. I'm curious about the authors' opinion on whether there is a binary criterion for whether features do/don't

Reviewer 03Rating 5Confidence 3

Strengths

- Proposes a novel and interesting strategy for pairing features between layers. - Studies some of the shortcomings (e.g. long tail of pairing 'failures') of this strategy. - The presentation is very clear and understandable.

Weaknesses

- It would be good to spend more time justifying the hypothesies of Section 3. I do not think that the results in Figure 3 constitute much evidence for Hypothesis 2, since the reasoning here seems slightly circular - you propose parameter folding based off the observation that $\theta$ tracks the activation norms, but then evaluate feature similarity using the same objective that you are explicitly trying to minimize. Therefore, it is trivially true that 'folding+matching' outperforms 'matching'

Videos

Mechanistic Permutability: Match Features Across Layers· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Handwritten Text Recognition Techniques