Explaining and Mitigating the Modality Gap in Contrastive Multimodal Learning
Can Yaras, Siyi Chen, Peng Wang, Qing Qu

TL;DR
This paper analyzes the causes of the modality gap in contrastive multimodal learning models like CLIP, revealing how data mismatch and temperature parameters contribute, and proposes strategies to mitigate it, improving task performance.
Contribution
The paper provides a theoretical analysis of the modality gap in contrastive multimodal models and introduces mitigation strategies based on gradient flow insights.
Findings
Mismatched data pairs and temperature parameters cause the modality gap.
Temperature scheduling and modality swapping can reduce the gap.
Reducing the gap improves image-text retrieval performance.
Abstract
Multimodal learning has recently gained significant popularity, demonstrating impressive performance across various zero-shot classification tasks and a range of perceptive and generative applications. Models such as Contrastive Language-Image Pretraining (CLIP) are designed to bridge different modalities, such as images and text, by learning a shared representation space through contrastive learning. Despite their success, the working mechanisms underlying multimodal learning are not yet well understood. Notably, these models often exhibit a modality gap, where different modalities occupy distinct regions within the shared representation space. In this work, we conduct an in-depth analysis of the emergence of modality gap by characterizing the gradient flow learning dynamics. Specifically, we identify the critical roles of mismatched data pairs and a learnable temperature parameter in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEFL/ESL Teaching and Learning · Discourse Analysis in Language Studies · Innovative Teaching and Learning Methods
MethodsContrastive Language-Image Pre-training
