From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance

Maximilian Dreyer; Lorenz Hufe; Jim Berend; Thomas Wiegand; Sebastian Lapuschkin; Wojciech Samek

arXiv:2505.20229·cs.LG·May 27, 2025

From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance

Maximilian Dreyer, Lorenz Hufe, Jim Berend, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek

PDF

Open Access 1 Repo

TL;DR

This paper introduces a scalable framework to interpret CLIP's internal components, revealing unexpected semantic reliance and spurious correlations, thereby advancing understanding of its decision-making process.

Contribution

The authors adapt attribution patching for CLIP, uncovering how latent components influence predictions and identifying unexpected semantic dependencies and artifacts.

Findings

01

Uncovered hundreds of surprising components linked to polysemous words and artifacts.

02

Text embeddings are more robust to spurious correlations than image embeddings.

03

Case study shows classifiers can amplify hidden shortcuts in skin lesion detection.

Abstract

Transformer-based CLIP models are widely used for text-image probing and feature extraction, making it relevant to understand the internal mechanisms behind their predictions. While recent works show that Sparse Autoencoders (SAEs) yield interpretable latent components, they focus on what these encode and miss how they drive predictions. We introduce a scalable framework that reveals what latent components activate for, how they align with expected semantics, and how important they are to predictions. To achieve this, we adapt attribution patching for instance-wise component attributions in CLIP and highlight key faithfulness limitations of the widely used Logit Lens technique. By combining attributions with semantic alignment scores, we can automatically uncover reliance on components that encode semantically unexpected or spurious concepts. Applied across multiple CLIP variants, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

maxdreyer/attributing-clip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsContrastive Language-Image Pre-training · Focus · ALIGN · Activation Patching