Extending Multi-modal Contrastive Representations
Zehan Wang, Ziang Zhang, Luping Liu, Yang Zhao, Haifeng Huang, Tao, Jin, Zhou Zhao

TL;DR
This paper introduces Ex-MCR, a training-efficient, paired-data-free method that extends multi-modal contrastive representations to more than three modalities by aligning existing MCR spaces, achieving state-of-the-art results.
Contribution
Ex-MCR is the first approach to extend multi-modal contrastive representations without paired data, integrating multiple existing MCRs into a unified space with improved performance.
Findings
Achieves state-of-the-art results on multiple retrieval tasks.
Learns a 3D-image-text-audio unified contrastive space without paired data.
Demonstrates emergent semantic alignment between extended modalities.
Abstract
Multi-modal contrastive representation (MCR) of more than three modalities is critical in multi-modal learning. Although recent methods showcase impressive achievements, the high dependence on large-scale, high-quality paired data and the expensive training costs limit their further development. Inspired by recent C-MCR, this paper proposes Extending Multimodal Contrastive Representation (Ex-MCR), a training-efficient and paired-data-free method to flexibly learn unified contrastive representation space for more than three modalities by integrating the knowledge of existing MCR spaces. Specifically, Ex-MCR aligns multiple existing MCRs into the same based MCR, which can effectively preserve the original semantic alignment of the based MCR. Besides, we comprehensively enhance the entire learning pipeline for aligning MCR spaces from the perspectives of training data, architecture, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsALIGN · Contrastive Language-Image Pre-training
