Scaling Up Visual and Vision-Language Representation Learning With Noisy   Text Supervision

Chao Jia; Yinfei Yang; Ye Xia; Yi-Ting Chen; Zarana Parekh; Hieu Pham,; Quoc V. Le; Yunhsuan Sung; Zhen Li; Tom Duerig

arXiv:2102.05918·cs.CV·June 14, 2021·1.2k cites

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham,, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig

PDF

Open Access 5 Repos 2 Models 2 Datasets 1 Video

TL;DR

This paper demonstrates that training on a large, noisy dataset of over one billion image alt-text pairs can produce state-of-the-art visual and vision-language representations without expensive data curation, enabling effective zero-shot and retrieval tasks.

Contribution

It introduces a scalable approach using noisy image-text data and a simple dual-encoder contrastive learning scheme to achieve competitive vision-language representations.

Findings

01

Achieves state-of-the-art results on image classification and retrieval benchmarks.

02

Enables effective zero-shot image classification with simple models.

03

Shows that large-scale noisy data can compensate for lack of precise curation.

Abstract

Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning