It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap
Abrar Fahim, Alex Murphy, Alona Fyshe

TL;DR
This paper investigates the inherent contrastive gap in multi-modal contrastive models like CLIP, attributes it to low uniformity in embedding space, and proposes modifications to improve embedding distribution and downstream task performance.
Contribution
The paper identifies the contrastive gap as an inherent issue in two-encoder contrastive models and introduces a method to close this gap by enhancing uniformity and alignment in the embedding space.
Findings
Closing the contrastive gap improves zero-shot classification accuracy.
Adding uniformity and alignment terms leads to more evenly distributed embeddings.
Modified models outperform default CLIP in downstream tasks.
Abstract
Multi-modal contrastive models such as CLIP achieve state-of-the-art performance in zero-shot classification by embedding input images and texts on a joint representational space. Recently, a modality gap has been reported in two-encoder contrastive models like CLIP, meaning that the image and text embeddings reside in disjoint areas of the latent space. Previous studies suggest that this gap exists due to 1) the cone effect, 2) mismatched pairs in the dataset, and 3) insufficient training. We show that, even when accounting for all these factors, and even when using the same modality, the contrastive loss actually creates a gap during training. As a result, We propose that the modality gap is inherent to the two-encoder contrastive loss and rename it the contrastive gap. We present evidence that attributes this contrastive gap to low uniformity in CLIP space, resulting in embeddings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Multimodal Machine Learning Applications
MethodsContrastive Language-Image Pre-training
