It's Not a Modality Gap: Characterizing and Addressing the Contrastive   Gap

Abrar Fahim; Alex Murphy; Alona Fyshe

arXiv:2405.18570·cs.CV·June 10, 2024

It's Not a Modality Gap: Characterizing and Addressing the Contrastive Gap

Abrar Fahim, Alex Murphy, Alona Fyshe

PDF

Open Access

TL;DR

This paper investigates the inherent contrastive gap in multi-modal contrastive models like CLIP, attributes it to low uniformity in embedding space, and proposes modifications to improve embedding distribution and downstream task performance.

Contribution

The paper identifies the contrastive gap as an inherent issue in two-encoder contrastive models and introduces a method to close this gap by enhancing uniformity and alignment in the embedding space.

Findings

01

Closing the contrastive gap improves zero-shot classification accuracy.

02

Adding uniformity and alignment terms leads to more evenly distributed embeddings.

03

Modified models outperform default CLIP in downstream tasks.

Abstract

Multi-modal contrastive models such as CLIP achieve state-of-the-art performance in zero-shot classification by embedding input images and texts on a joint representational space. Recently, a modality gap has been reported in two-encoder contrastive models like CLIP, meaning that the image and text embeddings reside in disjoint areas of the latent space. Previous studies suggest that this gap exists due to 1) the cone effect, 2) mismatched pairs in the dataset, and 3) insufficient training. We show that, even when accounting for all these factors, and even when using the same modality, the contrastive loss actually creates a gap during training. As a result, We propose that the modality gap is inherent to the two-encoder contrastive loss and rename it the contrastive gap. We present evidence that attributes this contrastive gap to low uniformity in CLIP space, resulting in embeddings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Multimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training