Explaining Caption-Image Interactions in CLIP Models with Second-Order Attributions
Lucas M\"oller, Pascal Tilli, Ngoc Thang Vu, Sebastian Pad\'o

TL;DR
This paper introduces a second-order attribution method to explain how CLIP models compare image and caption features, revealing detailed cross-modal correspondences and their limitations.
Contribution
It develops a novel second-order attribution technique for dual encoders and applies it to CLIP, uncovering fine-grained visual-linguistic interactions and their variability.
Findings
CLIP models learn detailed correspondences between image regions and caption parts
The visual-linguistic grounding varies significantly across object classes
Systematic errors and out-of-domain effects are identified in CLIP's interactions
Abstract
Dual encoder architectures like Clip models map two types of inputs into a shared embedding space and predict similarities between them. Despite their wide application, it is, however, not understood how these models compare their two inputs. Common first-order feature-attribution methods explain importances of individual features and can, thus, only provide limited insights into dual encoders, whose predictions depend on interactions between features. In this paper, we first derive a second-order method enabling the attribution of predictions by any differentiable dual encoder onto feature-interactions between its inputs. Second, we apply our method to Clip models and show that they learn fine-grained correspondences between parts of captions and regions in images. They match objects across input modes and also account for mismatches. This intrinsic visual-linguistic grounding ability,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
