Identifying Functionally Important Features with End-to-End Sparse   Dictionary Learning

Dan Braun; Jordan Taylor; Nicholas Goldowsky-Dill; Lee Sharkey

arXiv:2405.12241·cs.LG·May 27, 2024

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces end-to-end sparse dictionary learning, a method that trains sparse autoencoders to identify functionally important features in neural networks by directly aligning their output distributions with the original model.

Contribution

The paper proposes a novel end-to-end training approach for sparse autoencoders that ensures learned features are functionally relevant to the network's behavior.

Findings

01

E2E SAEs explain more network performance with fewer features.

02

E2E SAEs require fewer active features per data point.

03

E2E SAEs improve interpretability without sacrificing accuracy.

Abstract

Identifying the features learned by neural networks is a core challenge in mechanistic interpretability. Sparse autoencoders (SAEs), which learn a sparse, overcomplete dictionary that reconstructs a network's internal activations, have been used to identify these features. However, SAEs may learn more about the structure of the datatset than the computational structure of the network. There is therefore only indirect reason to believe that the directions found in these dictionaries are functionally important to the network. We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted. Compared to standard SAEs, e2e SAEs offer a Pareto improvement: They explain more network…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

apolloresearch/e2e_sae
jaxOfficial

Videos

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning· slideslive

Taxonomy

TopicsNatural Language Processing Techniques · Text and Document Classification Technologies · Video Analysis and Summarization

MethodsLib