Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning

Chuan Qin; Constantin Venhoff; Sonia Joseph; Fanyi Xiao; Stefan Scherer

arXiv:2601.20075·cs.CV·January 29, 2026

Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning

Chuan Qin, Constantin Venhoff, Sonia Joseph, Fanyi Xiao, Stefan Scherer

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Sparse CLIP, a training method that creates sparse, interpretable, and high-performing vision-language representations, challenging the idea that interpretability must compromise accuracy.

Contribution

The authors propose integrating sparsity directly into CLIP training to enhance interpretability without sacrificing downstream performance or multimodal capabilities.

Findings

01

Sparse CLIP maintains strong downstream task performance.

02

Sparse representations improve interpretability and concept alignment.

03

Multimodal capabilities are preserved in sparse representations.

Abstract

Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in vision-language representation learning, powering diverse downstream tasks and serving as the default vision backbone in multimodal large language models (MLLMs). Despite its success, CLIP's dense and opaque latent representations pose significant interpretability challenges. A common assumption is that interpretability and performance are in tension: enforcing sparsity during training degrades accuracy, motivating recent post-hoc approaches such as Sparse Autoencoders (SAEs). However, these post-hoc approaches often suffer from degraded downstream performance and loss of CLIP's inherent multimodal capabilities, with most learned features remaining unimodal. We propose a simple yet effective approach that integrates sparsity directly into CLIP training, yielding representations that are both interpretable and…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper incorporates sparsity into training, it achieves both interpretability and high performance, offering a compelling design. 2. Comprehensive experiments demonstrate that Sparse CLIP performs comparably to dense CLIP on zero-shot classification tasks and surpasses post-hoc Sparse Autoencoders (SAEs) in interpretability. 3. Applications in Vision-Language Models: This paper presents a practical demonstration of Sparse CLIP’s utility and relevance for downstream tasks. In summary, th

Weaknesses

1. While the findings are interesting, the proposed method is totally established on the existing works, making it relatively simple and lacks significant novelty. 2. Unclear generalizability: While Sparse CLIP performs well on zero-shot classification benchmarks, it is unclear how well the method generalizes to downstream tasks as the other CLIP models, such as object detection, segmentation, or open-vocabulary retrieval

Reviewer 02Rating 4Confidence 4

Strengths

This paper's strengths lie in the novelty and simplicity of the method: 1. The idea to train a CLIP with an inherently sparse representation is very interesting and can lead to significant progress in research on interpretability. It is simple and could be easily translated to other bi-modal encoders trained with the contrastive loss. 2. I appreciate the paper's flow, especially the 'lessons learned' given in Section 2.2, articulating the research process progressing into Section 2.3.

Weaknesses

Overall, this work is in an early stage, requiring revision and extension to be considered for publication: 1. Regarding soundness, the experiments are limited (see questions below). Contribution 3 (implementing interpretable vision-based steering in vision-language models) executed in Section 4 is overstated significantly: the two examples shown in Table 2 cannot reliably demonstrate real-world applicability. 2. The paper requires proofreading (see feedback below; I just stopped listing at some

Reviewer 03Rating 6Confidence 4

Strengths

1. This paper proposes the SPARSE CLIP method, which effectively combines the strengths of both CLIP and SAE models, ensuring interpretability and performance in contrastive learning. 2. The paper provides a solid analysis of the interpretability achieved by combining large models with SPARSE CLIP. The cases presented in paper demonstrate the strong interpretability of SPARSE CLIP as well as its powerful application potential.

Weaknesses

The proposed SPARSE CLIP method exhibits strong interpretability, but its performance drops on many benchmarks compared to the original ViT-based CLIP baseline. I hypothesize that this is due to CLIP’s reliance on maintaining rich and continuous directional information in the embedding space. The discontinuity introduced by ReLU indeed produces highly sparse and interpretable features, but it also collapses many feature dimensions, resulting in sparse vectors that weaken cross-modal alignment an

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)