Interpretability as Compression: Reconsidering SAE Explanations of   Neural Activations with MDL-SAEs

Kola Ayonrinde; Michael T. Pearce; Lee Sharkey

arXiv:2410.11179·cs.LG·October 16, 2024

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Kola Ayonrinde, Michael T. Pearce, Lee Sharkey

PDF

Open Access

TL;DR

This paper introduces an information-theoretic framework using MDL principles to improve the interpretability of Sparse Autoencoders by promoting concise, accurate, and independent features for explaining neural activations.

Contribution

It proposes a novel MDL-based approach for interpreting SAEs, emphasizing feature independence and hierarchical structures to enhance explanation quality.

Findings

01

SAEs trained with MDL produce features representing significant image parts.

02

MDL-based explanations avoid issues like feature splitting seen with sparsity.

03

Hierarchical SAE architectures emerge naturally from the MDL framework.

Abstract

Sparse Autoencoders (SAEs) have emerged as a useful tool for interpreting the internal representations of neural networks. However, naively optimising SAEs for reconstruction loss and sparsity results in a preference for SAEs that are extremely wide and sparse. We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms for communicating explanations of neural activations. We appeal to the Minimal Description Length (MDL) principle to motivate explanations of activations which are both accurate and concise. We further argue that interpretable SAEs require an additional property, "independent additivity": features should be able to be understood separately. We demonstrate an example of applying our MDL-inspired framework by training SAEs on MNIST handwritten digits and find that SAE features representing significant line segments are optimal, as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning

MethodsMinimum Description Length