Learning Transferable Visual Models From Natural Language Supervision

Alec Radford; Jong Wook Kim; Chris Hallacy; Aditya Ramesh; Gabriel; Goh; Sandhini Agarwal; Girish Sastry; Amanda Askell; Pamela Mishkin; Jack; Clark; Gretchen Krueger; Ilya Sutskever

arXiv:2103.00020·cs.CV·March 2, 2021·5.3k cites

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel, Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack, Clark, Gretchen Krueger, Ilya Sutskever

PDF

Open Access 5 Repos 10 Models 5 Datasets 2 Videos

TL;DR

This paper introduces a method for training visual models using natural language supervision from large-scale image-caption pairs, enabling zero-shot transfer to various vision tasks without task-specific training.

Contribution

The authors demonstrate that simple image-caption matching pre-training on 400 million pairs yields state-of-the-art representations capable of zero-shot transfer across diverse tasks.

Findings

01

Achieves competitive zero-shot performance on 30+ datasets.

02

Matches ImageNet accuracy without using ImageNet training data.

03

Enables flexible referencing of visual concepts through natural language.

Abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

OpenAI CLIP | Machine Learning Coding Series· youtube

Learning Transferable Visual Models From Natural Language Supervision· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition

Methods3 Dimensional Convolutional Neural Network · Contrastive Language-Image Pre-training