Understanding Multimodal Contrastive Learning and Incorporating Unpaired Data
Ryumei Nakada, Halil Ibrahim Gulluk, Zhun Deng, Wenlong Ji, James Zou,, Linjun Zhang

TL;DR
This paper investigates nonlinear contrastive loss functions in multimodal learning, revealing their connection to SVD, and demonstrates how incorporating unpaired data can enhance model robustness and performance.
Contribution
It introduces a general nonlinear loss framework for MMCL, analyzes its connection to SVD, and proposes a method to leverage unpaired data for improved learning.
Findings
MMCL loss relates to SVD of cross-covariance matrices.
MMCL can outperform unimodal contrastive learning even with noisy pairs.
Incorporating unpaired data improves MMCL performance.
Abstract
Language-supervised vision models have recently attracted great attention in computer vision. A common approach to build such models is to use contrastive learning on paired data across the two modalities, as exemplified by Contrastive Language-Image Pre-Training (CLIP). In this paper, under linear representation settings, (i) we initiate the investigation of a general class of nonlinear loss functions for multimodal contrastive learning (MMCL) including CLIP loss and show its connection to singular value decomposition (SVD). Namely, we show that each step of loss minimization by gradient descent can be seen as performing SVD on a contrastive cross-covariance matrix. Based on this insight, (ii) we analyze the performance of MMCL. We quantitatively show that the feature learning ability of MMCL can be better than that of unimodal contrastive learning applied to each modality even under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Radiomics and Machine Learning in Medical Imaging
MethodsContrastive Language-Image Pre-training · Contrastive Learning
