Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs
Kola Ayonrinde, Michael T. Pearce, Lee Sharkey

TL;DR
This paper introduces an information-theoretic framework using MDL principles to improve the interpretability of Sparse Autoencoders by promoting concise, accurate, and independent features for explaining neural activations.
Contribution
It proposes a novel MDL-based approach for interpreting SAEs, emphasizing feature independence and hierarchical structures to enhance explanation quality.
Findings
SAEs trained with MDL produce features representing significant image parts.
MDL-based explanations avoid issues like feature splitting seen with sparsity.
Hierarchical SAE architectures emerge naturally from the MDL framework.
Abstract
Sparse Autoencoders (SAEs) have emerged as a useful tool for interpreting the internal representations of neural networks. However, naively optimising SAEs for reconstruction loss and sparsity results in a preference for SAEs that are extremely wide and sparse. We present an information-theoretic framework for interpreting SAEs as lossy compression algorithms for communicating explanations of neural activations. We appeal to the Minimal Description Length (MDL) principle to motivate explanations of activations which are both accurate and concise. We further argue that interpretable SAEs require an additional property, "independent additivity": features should be able to be understood separately. We demonstrate an example of applying our MDL-inspired framework by training SAEs on MNIST handwritten digits and find that SAE features representing significant line segments are optimal, as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning
MethodsMinimum Description Length
