Interpreting CLIP with Hierarchical Sparse Autoencoders

Vladimir Zaigrajew; Hubert Baniecki; Przemyslaw Biecek

arXiv:2502.20578·cs.CV·May 29, 2025

Interpreting CLIP with Hierarchical Sparse Autoencoders

Vladimir Zaigrajew, Hubert Baniecki, Przemyslaw Biecek

PDF

Open Access 2 Models

TL;DR

This paper introduces Matryoshka SAE, a hierarchical autoencoder that improves interpretability of CLIP by balancing reconstruction quality and sparsity, enabling extraction of semantic concepts for analysis and control.

Contribution

The paper proposes MSAE, a hierarchical autoencoder that optimizes both sparsity and reconstruction quality simultaneously, advancing interpretability of large-scale vision-language models.

Findings

01

MSAE achieves state-of-the-art Pareto frontier between sparsity and reconstruction quality.

02

MSAE extracts over 120 semantic concepts from CLIP representations.

03

MSAE enables concept-based similarity search and bias analysis.

Abstract

Sparse autoencoders (SAEs) are useful for detecting and steering interpretable features in neural networks, with particular potential for understanding complex multimodal representations. Given their ability to uncover interpretable features, SAEs are particularly valuable for analyzing large-scale vision-language models (e.g., CLIP and SigLIP), which are fundamental building blocks in modern systems yet remain challenging to interpret and control. However, current SAE methods are limited by optimizing both reconstruction quality and sparsity simultaneously, as they rely on either activation suppression or rigid sparsity constraints. To this end, we introduce Matryoshka SAE (MSAE), a new architecture that learns hierarchical representations at multiple granularities simultaneously, enabling a direct optimization of both metrics without compromise. MSAE establishes a new state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning · Multimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training