Towards Principled Evaluations of Sparse Autoencoders for   Interpretability and Control

Aleksandar Makelov; George Lange; Neel Nanda

arXiv:2405.08366·cs.LG·May 21, 2024·3 cites

Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control

Aleksandar Makelov, George Lange, Neel Nanda

PDF

Open Access

TL;DR

This paper introduces a framework for evaluating sparse autoencoders in interpretability tasks by comparing unsupervised features to supervised benchmarks, revealing insights into feature quality and training phenomena.

Contribution

It proposes a novel evaluation framework for sparse autoencoders using supervised dictionaries as ground-truth references, enabling more objective interpretability assessments.

Findings

01

SAEs capture interpretable features for IOI task

02

SAEs are less effective than supervised features in controlling models

03

Identified phenomena: feature occlusion and over-splitting

Abstract

Disentangling model activations into meaningful features is a central problem in interpretability. However, the absence of ground-truth for these features in realistic scenarios makes validating recent approaches, such as sparse dictionary learning, elusive. To address this challenge, we propose a framework for evaluating feature dictionaries in the context of specific tasks, by comparing them against \emph{supervised} feature dictionaries. First, we demonstrate that supervised dictionaries achieve excellent approximation, control, and interpretability of model computations on the task. Second, we use the supervised dictionaries to develop and contextualize evaluations of unsupervised dictionaries along the same three axes. We apply this framework to the indirect object identification (IOI) task using GPT-2 Small, with sparse autoencoders (SAEs) trained on either the IOI or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Discriminative Fine-Tuning · Multi-Head Attention · Dense Connections · Attention Dropout · Weight Decay · Cosine Annealing · Dropout · Linear Warmup With Cosine Annealing