Attribution-Guided Distillation of Matryoshka Sparse Autoencoders
Cristina P. Martin-Linares, Jonathan P. Ling

TL;DR
This paper introduces DMSAEs, a distillation method for sparse autoencoders that identifies and transfers a core set of consistently useful features, improving interpretability and transferability across training runs and sparsity levels.
Contribution
The paper proposes a novel distillation pipeline for sparse autoencoders that extracts a core set of features, enhancing interpretability and transferability.
Findings
Distilled core features are consistently selected across cycles.
Training with the distilled core improves SAEBench metrics.
The method enables transfer of features across different sparsity levels.
Abstract
Sparse autoencoders (SAEs) aim to disentangle model activations into monosemantic, human-interpretable features. In practice, learned features are often redundant and vary across training runs and sparsity levels, which makes interpretations difficult to transfer and reuse. We introduce Distilled Matryoshka Sparse Autoencoders (DMSAEs), a training pipeline that distills a compact core of consistently useful features and reuses it to train new SAEs. DMSAEs run an iterative distillation cycle: train a Matryoshka SAE with a shared core, use gradient X activation to measure each feature's contribution to next-token loss in the most nested reconstruction, and keep only the smallest subset that explains a fixed fraction of the attribution. Only the core encoder weight vectors are transferred across cycles; the core decoder and all non-core latents are reinitialized each time. On Gemma-2-2B…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
