Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning
Chuan Qin, Constantin Venhoff, Sonia Joseph, Fanyi Xiao, Stefan Scherer

TL;DR
This paper introduces Sparse CLIP, a training method that creates sparse, interpretable, and high-performing vision-language representations, challenging the idea that interpretability must compromise accuracy.
Contribution
The authors propose integrating sparsity directly into CLIP training to enhance interpretability without sacrificing downstream performance or multimodal capabilities.
Findings
Sparse CLIP maintains strong downstream task performance.
Sparse representations improve interpretability and concept alignment.
Multimodal capabilities are preserved in sparse representations.
Abstract
Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in vision-language representation learning, powering diverse downstream tasks and serving as the default vision backbone in multimodal large language models (MLLMs). Despite its success, CLIP's dense and opaque latent representations pose significant interpretability challenges. A common assumption is that interpretability and performance are in tension: enforcing sparsity during training degrades accuracy, motivating recent post-hoc approaches such as Sparse Autoencoders (SAEs). However, these post-hoc approaches often suffer from degraded downstream performance and loss of CLIP's inherent multimodal capabilities, with most learned features remaining unimodal. We propose a simple yet effective approach that integrates sparsity directly into CLIP training, yielding representations that are both interpretable and…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper incorporates sparsity into training, it achieves both interpretability and high performance, offering a compelling design. 2. Comprehensive experiments demonstrate that Sparse CLIP performs comparably to dense CLIP on zero-shot classification tasks and surpasses post-hoc Sparse Autoencoders (SAEs) in interpretability. 3. Applications in Vision-Language Models: This paper presents a practical demonstration of Sparse CLIP’s utility and relevance for downstream tasks. In summary, th
1. While the findings are interesting, the proposed method is totally established on the existing works, making it relatively simple and lacks significant novelty. 2. Unclear generalizability: While Sparse CLIP performs well on zero-shot classification benchmarks, it is unclear how well the method generalizes to downstream tasks as the other CLIP models, such as object detection, segmentation, or open-vocabulary retrieval
This paper's strengths lie in the novelty and simplicity of the method: 1. The idea to train a CLIP with an inherently sparse representation is very interesting and can lead to significant progress in research on interpretability. It is simple and could be easily translated to other bi-modal encoders trained with the contrastive loss. 2. I appreciate the paper's flow, especially the 'lessons learned' given in Section 2.2, articulating the research process progressing into Section 2.3.
Overall, this work is in an early stage, requiring revision and extension to be considered for publication: 1. Regarding soundness, the experiments are limited (see questions below). Contribution 3 (implementing interpretable vision-based steering in vision-language models) executed in Section 4 is overstated significantly: the two examples shown in Table 2 cannot reliably demonstrate real-world applicability. 2. The paper requires proofreading (see feedback below; I just stopped listing at some
1. This paper proposes the SPARSE CLIP method, which effectively combines the strengths of both CLIP and SAE models, ensuring interpretability and performance in contrastive learning. 2. The paper provides a solid analysis of the interpretability achieved by combining large models with SPARSE CLIP. The cases presented in paper demonstrate the strong interpretability of SPARSE CLIP as well as its powerful application potential.
The proposed SPARSE CLIP method exhibits strong interpretability, but its performance drops on many benchmarks compared to the original ViT-based CLIP baseline. I hypothesize that this is due to CLIP’s reliance on maintaining rich and continuous directional information in the embedding space. The discontinuity introduced by ReLU indeed produces highly sparse and interpretable features, but it also collapses many feature dimensions, resulting in sparse vectors that weaken cross-modal alignment an
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
