MetaSAEs: Joint Training with a Decomposability Penalty Produces More Atomic Sparse Autoencoder Latents
Matthew Levinson

TL;DR
This paper introduces a joint training method with a decomposability penalty that enhances the atomicity and interpretability of sparse autoencoder latents, validated on GPT-2 and Gemma 2 models.
Contribution
A novel joint training objective that reduces subspace blending in SAE latents by penalizing their reconstructability from a meta dictionary, improving interpretability.
Findings
Mean $|phi|$ reduced by 7.5% on GPT-2 large.
Automated interpretability scores improved by 7.6%.
Method shows promising transferability to larger models.
Abstract
Sparse autoencoders (SAEs) are increasingly used for safety-relevant applications including alignment detection and model steering. These use cases require SAE latents to be as atomic as possible. Each latent should represent a single coherent concept drawn from a single underlying representational subspace. In practice, SAE latents blend representational subspaces together. A single feature can activate across semantically distinct contexts that share no true common representation, muddying an already complex picture of model computation. We introduce a joint training objective that directly penalizes this subspace blending. A small meta SAE is trained alongside the primary SAE to sparsely reconstruct the primary SAE's decoder columns; the primary SAE is penalized whenever its decoder directions are easy to reconstruct from the meta dictionary. This occurs whenever latent directions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
