Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
Marco Mistretta, Alberto Baldrati, Lorenzo Agnolucci, Marco Bertini,, Andrew D. Bagdanov

TL;DR
This paper reveals intra-modal misalignment in CLIP caused by its inter-modal contrastive loss and proposes modality inversion techniques to improve intra-modal task performance, highlighting the importance of intra-modal constraints.
Contribution
It introduces modality inversion methods to expose intra-modal misalignment in CLIP and demonstrates how addressing this improves intra-modal task performance.
Findings
Intra-modal retrieval performance improves with inter-modal approaches.
Intra-modal tasks are negatively impacted when approached inter-modally.
Incorporating intra-modal constraints reduces intra-modal misalignment.
Abstract
Pre-trained multi-modal Vision-Language Models like CLIP are widely used off-the-shelf for a variety of applications. In this paper, we show that the common practice of individually exploiting the text or image encoders of these powerful multi-modal models is highly suboptimal for intra-modal tasks like image-to-image retrieval. We argue that this is inherently due to the CLIP-style inter-modal contrastive loss that does not enforce any intra-modal constraints, leading to what we call intra-modal misalignment. To demonstrate this, we leverage two optimization-based modality inversion techniques that map representations from their input modality to the complementary one without any need for auxiliary data or additional trained adapters. We empirically show that, in the intra-modal tasks of image-to-image and text-to-text retrieval, approaching these tasks inter-modally significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsContrastive Language-Image Pre-training
