Explaining Caption-Image Interactions in CLIP Models with Second-Order Attributions

Lucas M\"oller; Pascal Tilli; Ngoc Thang Vu; Sebastian Pad\'o

arXiv:2408.14153·cs.CV·August 14, 2025

Explaining Caption-Image Interactions in CLIP Models with Second-Order Attributions

Lucas M\"oller, Pascal Tilli, Ngoc Thang Vu, Sebastian Pad\'o

PDF

Open Access

TL;DR

This paper introduces a second-order attribution method to explain how CLIP models compare image and caption features, revealing detailed cross-modal correspondences and their limitations.

Contribution

It develops a novel second-order attribution technique for dual encoders and applies it to CLIP, uncovering fine-grained visual-linguistic interactions and their variability.

Findings

01

CLIP models learn detailed correspondences between image regions and caption parts

02

The visual-linguistic grounding varies significantly across object classes

03

Systematic errors and out-of-domain effects are identified in CLIP's interactions

Abstract

Dual encoder architectures like Clip models map two types of inputs into a shared embedding space and predict similarities between them. Despite their wide application, it is, however, not understood how these models compare their two inputs. Common first-order feature-attribution methods explain importances of individual features and can, thus, only provide limited insights into dual encoders, whose predictions depend on interactions between features. In this paper, we first derive a second-order method enabling the attribution of predictions by any differentiable dual encoder onto feature-interactions between its inputs. Second, we apply our method to Clip models and show that they learn fine-grained correspondences between parts of captions and regions in images. They match objects across input modes and also account for mismatches. This intrinsic visual-linguistic grounding ability,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training