Enhancing Neural Network Interpretability with Feature-Aligned Sparse   Autoencoders

Luke Marks; Alasdair Paren; David Krueger; Fazl Barez

arXiv:2411.01220·cs.LG·November 7, 2024·2 cites

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders

Luke Marks, Alasdair Paren, David Krueger, Fazl Barez

PDF

Open Access 1 Repo

TL;DR

This paper introduces Mutual Feature Regularization (MFR), a technique that encourages parallel sparse autoencoders to learn similar, input-relevant features, improving interpretability and reconstruction performance on synthetic, EEG, and GPT-2 data.

Contribution

The paper proposes MFR, a novel regularization method that aligns features learned by multiple SAEs, enhancing interpretability and reconstruction accuracy across different data types.

Findings

01

MFR improves SAE reconstruction loss by up to 21.21% on GPT-2 Small.

02

MFR helps SAEs learn input features more effectively on synthetic data.

03

Enhanced interpretability of neural network features through aligned SAE training.

Abstract

Sparse Autoencoders (SAEs) have shown promise in improving the interpretability of neural network activations, but can learn features that are not features of the input, limiting their effectiveness. We propose \textsc{Mutual Feature Regularization} \textbf{(MFR)}, a regularization technique for improving feature learning by encouraging SAEs trained in parallel to learn similar features. We motivate \textsc{MFR} by showing that features learned by multiple SAEs are more likely to correlate with features of the input. By training on synthetic data with known features of the input, we show that \textsc{MFR} can help SAEs learn those features, as we can directly compare the features learned by the SAE with the input features for the synthetic data. We then scale \textsc{MFR} to SAEs that are trained to denoise electroencephalography (EEG) data and SAEs that are trained to reconstruct GPT-2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

luke-marks0/mutual-feature-regularization
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Neural Networks and Applications · Explainable Artificial Intelligence (XAI)

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Cosine Annealing · Dense Connections · Layer Normalization · Residual Connection · Linear Warmup With Cosine Annealing · Adam · Attention Dropout