Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal   Data

Yuhui Zhang; Elaine Sui; Serena Yeung-Levy

arXiv:2401.08567·cs.LG·January 17, 2024·1 cites

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

Yuhui Zhang, Elaine Sui, Serena Yeung-Levy

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a theoretical framework and a novel three-step method, C^3, to improve cross-modal learning from uni-modal data by addressing the modality gap in contrastive representation spaces, leading to state-of-the-art results.

Contribution

It provides a theoretical analysis of the multi-modal contrastive space and proposes C^3, a method to bridge the modality gap, enhancing cross-modal task performance from uni-modal data.

Findings

01

Achieves state-of-the-art zero-shot captioning results

02

Improves cross-modal learning effectiveness

03

Validates the theoretical analysis with empirical results

Abstract

Building cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. This is based on the assumption that contrastive optimization makes embeddings from different modalities interchangeable. However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists. In our study, we provide a theoretical explanation of this space's geometry and introduce a three-step method, $C^{3}$ (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings. Our $C^{3}$ method significantly improves cross-modal learning from uni-modal data, achieving state-of-the-art results on zero-shot image / audio / video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuhui-zh15/c3
pytorchOfficial

Videos

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques