Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive   Representation Learning

Weixin Liang; Yuhui Zhang; Yongchan Kwon; Serena Yeung; James Zou

arXiv:2203.02053·cs.CL·October 21, 2022·98 cites

Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning

Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, James Zou

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper investigates the geometric modality gap in multi-modal models like CLIP, revealing its causes and effects on model performance and fairness, and providing insights for improving multi-modal contrastive learning.

Contribution

The study offers a systematic analysis of the modality gap, combining empirical and theoretical insights into its origins and impact on downstream tasks.

Findings

01

Modality gap is caused by model initialization and contrastive learning.

02

Varying the modality gap affects zero-shot classification performance.

03

Reducing the modality gap improves fairness and accuracy.

Abstract

We present modality gap, an intriguing geometric phenomenon of the representation space of multi-modal models. Specifically, we show that different data modalities (e.g. images and text) are embedded at arm's length in their shared representation in multi-modal models such as CLIP. Our systematic analysis demonstrates that this gap is caused by a combination of model initialization and contrastive learning optimization. In model initialization, we show empirically and theoretically that the representation of a common deep neural network is restricted to a narrow cone. As a consequence, in a multi-modal model with two encoders, the representations of the two modalities are clearly apart when the model is initialized. During optimization, contrastive learning keeps the different modalities separate by a certain distance, which is influenced by the temperature parameter in the loss…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Learning · Contrastive Language-Image Pre-training