From What to How: Attributing CLIP's Latent Components Reveals Unexpected Semantic Reliance
Maximilian Dreyer, Lorenz Hufe, Jim Berend, Thomas Wiegand, Sebastian Lapuschkin, Wojciech Samek

TL;DR
This paper introduces a scalable framework to interpret CLIP's internal components, revealing unexpected semantic reliance and spurious correlations, thereby advancing understanding of its decision-making process.
Contribution
The authors adapt attribution patching for CLIP, uncovering how latent components influence predictions and identifying unexpected semantic dependencies and artifacts.
Findings
Uncovered hundreds of surprising components linked to polysemous words and artifacts.
Text embeddings are more robust to spurious correlations than image embeddings.
Case study shows classifiers can amplify hidden shortcuts in skin lesion detection.
Abstract
Transformer-based CLIP models are widely used for text-image probing and feature extraction, making it relevant to understand the internal mechanisms behind their predictions. While recent works show that Sparse Autoencoders (SAEs) yield interpretable latent components, they focus on what these encode and miss how they drive predictions. We introduce a scalable framework that reveals what latent components activate for, how they align with expected semantics, and how important they are to predictions. To achieve this, we adapt attribution patching for instance-wise component attributions in CLIP and highlight key faithfulness limitations of the widely used Logit Lens technique. By combining attributions with semantic alignment scores, we can automatically uncover reliance on components that encode semantically unexpected or spurious concepts. Applied across multiple CLIP variants, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsContrastive Language-Image Pre-training · Focus · ALIGN · Activation Patching
