Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel, Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack, Clark, Gretchen Krueger, Ilya Sutskever

TL;DR
This paper introduces a method for training visual models using natural language supervision from large-scale image-caption pairs, enabling zero-shot transfer to various vision tasks without task-specific training.
Contribution
The authors demonstrate that simple image-caption matching pre-training on 400 million pairs yields state-of-the-art representations capable of zero-shot transfer across diverse tasks.
Findings
Achieves competitive zero-shot performance on 30+ datasets.
Matches ImageNet accuracy without using ImageNet training data.
Enables flexible referencing of visual concepts through natural language.
Abstract
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗stable-diffusion-v1-5/stable-diffusion-v1-5model· 1.7M dl· ♡ 10661.7M dl♡ 1066
- 🤗openai/clip-vit-large-patch14model· 24.8M dl· ♡ 198224.8M dl♡ 1982
- 🤗openai/clip-vit-base-patch32model· 20.3M dl· ♡ 89520.3M dl♡ 895
- 🤗CompVis/stable-diffusion-v1-4model· 468k dl· ♡ 6991468k dl♡ 6991
- 🤗CompVis/stable-diffusion-v-1-4-originalmodel· ♡ 2843♡ 2843
- 🤗openai/clip-vit-base-patch16model· 1.9M dl· ♡ 1541.9M dl♡ 154
- 🤗sentence-transformers/clip-ViT-B-32model· ♡ 148♡ 148
- 🤗jm12138/riffusion-model-v1model· ♡ 3♡ 3
- 🤗apple/MobileCLIP-S0model· 67 dl· ♡ 1367 dl♡ 13
- 🤗apple/MobileCLIP2-L-14model· 20 dl· ♡ 320 dl♡ 3
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
Methods3 Dimensional Convolutional Neural Network · Contrastive Language-Image Pre-training
