Features that Make a Difference: Leveraging Gradients for Improved   Dictionary Learning

Jeffrey Olmo; Jared Wilson; Max Forsey; Bryce Hepner; Thomas Vin Howe,; David Wingate

arXiv:2411.10397·cs.LG·April 2, 2025

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

Jeffrey Olmo, Jared Wilson, Max Forsey, Bryce Hepner, Thomas Vin Howe,, David Wingate

PDF

Open Access 1 Video

TL;DR

This paper introduces Gradient SAEs, a novel autoencoder that uses gradient information to select features, resulting in more faithful reconstructions and more effective latent features for downstream tasks.

Contribution

Gradient SAEs incorporate gradient information into feature selection, improving the faithfulness of reconstructions and the effectiveness of learned features for downstream model steering.

Findings

01

g-SAEs produce more faithful reconstructions.

02

g-SAEs learn latents better at steering models.

03

Gradient-based feature selection enhances autoencoder performance.

Abstract

Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activations have on downstream computations. This limits the information available to learn features, and biases the autoencoder towards neglecting features which are represented with small activation values but strongly influence model outputs. To address this, we introduce Gradient SAEs (g-SAEs), which modify the $k$ -sparse autoencoder architecture by augmenting the TopK activation function to rely on the gradients of the input activation when selecting the $k$ elements. For a given sparsity level, g-SAEs produce reconstructions that are more faithful to original network performance when propagated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning· underline

Taxonomy

TopicsLexicography and Language Studies · Natural Language Processing Techniques · Second Language Acquisition and Learning