Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning
Jeffrey Olmo, Jared Wilson, Max Forsey, Bryce Hepner, Thomas Vin Howe,, David Wingate

TL;DR
This paper introduces Gradient SAEs, a novel autoencoder that uses gradient information to select features, resulting in more faithful reconstructions and more effective latent features for downstream tasks.
Contribution
Gradient SAEs incorporate gradient information into feature selection, improving the faithfulness of reconstructions and the effectiveness of learned features for downstream model steering.
Findings
g-SAEs produce more faithful reconstructions.
g-SAEs learn latents better at steering models.
Gradient-based feature selection enhances autoencoder performance.
Abstract
Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network's internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activations have on downstream computations. This limits the information available to learn features, and biases the autoencoder towards neglecting features which are represented with small activation values but strongly influence model outputs. To address this, we introduce Gradient SAEs (g-SAEs), which modify the -sparse autoencoder architecture by augmenting the TopK activation function to rely on the gradients of the input activation when selecting the elements. For a given sparsity level, g-SAEs produce reconstructions that are more faithful to original network performance when propagated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLexicography and Language Studies · Natural Language Processing Techniques · Second Language Acquisition and Learning
