Contrastive Vision-Language Pre-training with Limited Resources

Quan Cui; Boyan Zhou; Yu Guo; Weidong Yin; Hao Wu; Osamu Yoshie; Yubo; Chen

arXiv:2112.09331·cs.CV·July 19, 2022

Contrastive Vision-Language Pre-training with Limited Resources

Quan Cui, Boyan Zhou, Yu Guo, Weidong Yin, Hao Wu, Osamu Yoshie, Yubo, Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces resource-efficient contrastive vision-language pre-training methods that achieve competitive results using significantly less data and computational power, making multi-modal learning more accessible.

Contribution

The authors propose novel methods enabling effective dual-encoder pre-training with limited resources and provide a reproducible baseline called ZeroVL using only 14M datasets and 8 GPUs.

Findings

01

ZeroVL achieves comparable or superior results to state-of-the-art methods.

02

The methods significantly reduce data and computational requirements.

03

Large-scale web data further improves performance.

Abstract

Pioneering dual-encoder pre-training works (e.g., CLIP and ALIGN) have revealed the potential of aligning multi-modal representations with contrastive learning. However, these works require a tremendous amount of data and computational resources (e.g., billion-level web data and hundreds of GPUs), which prevent researchers with limited resources from reproduction and further exploration. To this end, we propose a stack of novel methods, which significantly cut down the heavy resource dependency and allow us to conduct dual-encoder multi-modal representation alignment with limited resources. Besides, we provide a reproducible baseline of competitive results, namely ZeroVL, with only 14M publicly accessible academic datasets and 8 V100 GPUs. Additionally, we collect 100M web data for pre-training, and achieve comparable or superior results than state-of-the-art methods, further proving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zerovl/zerovl
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training