Interpreting CLIP with Hierarchical Sparse Autoencoders
Vladimir Zaigrajew, Hubert Baniecki, Przemyslaw Biecek

TL;DR
This paper introduces Matryoshka SAE, a hierarchical autoencoder that improves interpretability of CLIP by balancing reconstruction quality and sparsity, enabling extraction of semantic concepts for analysis and control.
Contribution
The paper proposes MSAE, a hierarchical autoencoder that optimizes both sparsity and reconstruction quality simultaneously, advancing interpretability of large-scale vision-language models.
Findings
MSAE achieves state-of-the-art Pareto frontier between sparsity and reconstruction quality.
MSAE extracts over 120 semantic concepts from CLIP representations.
MSAE enables concept-based similarity search and bias analysis.
Abstract
Sparse autoencoders (SAEs) are useful for detecting and steering interpretable features in neural networks, with particular potential for understanding complex multimodal representations. Given their ability to uncover interpretable features, SAEs are particularly valuable for analyzing large-scale vision-language models (e.g., CLIP and SigLIP), which are fundamental building blocks in modern systems yet remain challenging to interpret and control. However, current SAE methods are limited by optimizing both reconstruction quality and sparsity simultaneously, as they rely on either activation suppression or rigid sparsity constraints. To this end, we introduce Matryoshka SAE (MSAE), a new architecture that learns hierarchical representations at multiple granularities simultaneously, enabling a direct optimization of both metrics without compromise. MSAE establishes a new state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Multimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
