Data-Efficient Multimodal Fusion on a Single GPU

No\"el Vouitsis; Zhaoyan Liu; Satya Krishna Gorti; Valentin; Villecroze; Jesse C. Cresswell; Guangwei Yu; Gabriel Loaiza-Ganem; Maksims; Volkovs

arXiv:2312.10144·cs.LG·April 11, 2024·1 cites

Data-Efficient Multimodal Fusion on a Single GPU

No\"el Vouitsis, Zhaoyan Liu, Satya Krishna Gorti, Valentin, Villecroze, Jesse C. Cresswell, Guangwei Yu, Gabriel Loaiza-Ganem, Maksims, Volkovs

PDF

Open Access 2 Repos

TL;DR

FuseMix enables efficient multimodal fusion by leveraging pre-trained unimodal encoders, achieving competitive results with significantly less compute and data, and can adapt generative models across modalities.

Contribution

Proposes FuseMix, a novel multimodal augmentation method that operates on latent spaces of pre-trained unimodal encoders, reducing training costs while maintaining high performance.

Findings

01

Outperforms state-of-the-art methods in image-text and audio-text retrieval.

02

Requires approximately 600 times fewer GPU days and 80 times less data than CLIP.

03

Can convert pre-trained text-to-image models into audio-to-image models.

Abstract

The goal of multimodal alignment is to learn a single latent space that is shared between multimodal inputs. The most powerful models in this space have been trained using massive datasets of paired inputs and large-scale computational resources, making them prohibitively expensive to train in many practical scenarios. We surmise that existing unimodal encoders pre-trained on large amounts of unimodal data should provide an effective bootstrap to create multimodal models from unimodal ones at much lower costs. We therefore propose FuseMix, a multimodal augmentation scheme that operates on the latent spaces of arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal alignment, we achieve competitive performance -- and in certain cases outperform state-of-the art methods -- in both image-text and audio-text retrieval, with orders of magnitude less compute and data: for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques

MethodsContrastive Language-Image Pre-training