Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP
Jonas Herzog, Yue Wang

TL;DR
This paper critically reexamines the intra-modal misalignment hypothesis in CLIP, showing that the supposed misalignment does not exist and that task ambiguity, not intra-modal misalignment, explains performance issues.
Contribution
The study refutes the intra-modal misalignment hypothesis in CLIP and demonstrates that intra-modal distances are not inherently misaligned, emphasizing the importance of task ambiguity over intra-modal alignment.
Findings
Intra-modal distances are similar across different training paradigms.
Theoretical analysis shows no degrees of freedom for image embedding distances.
Addressing task ambiguity improves intra-modal task performance.
Abstract
Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment specific to the former.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Neurobiology of Language and Bilingualism · Domain Adaptation and Few-Shot Learning
