Reproducible scaling laws for contrastive language-image learning
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman,, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, Jenia, Jitsev

TL;DR
This paper investigates how contrastive language-image models scale with data and model size using public datasets, revealing key factors affecting performance and providing open-source tools for reproducibility.
Contribution
It presents the first large-scale study of scaling laws for CLIP models trained on public data, highlighting the impact of training distribution and providing open-source resources.
Findings
Power law scaling observed across multiple tasks
Training distribution significantly affects scaling behavior
Open-source models and evaluation workflow provided
Abstract
Scaling up neural networks has led to remarkable performance across a wide range of tasks. Moreover, performance often follows reliable scaling laws as a function of training set size, model size, and compute, which offers valuable guidance as large-scale experiments are becoming increasingly expensive. However, previous work on scaling laws has primarily used private data \& models or focused on uni-modal language or vision learning. To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository. Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks including zero-shot classification, retrieval, linear probing, and end-to-end fine-tuning. We find that the training distribution…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗laion/CLIP-ViT-bigG-14-laion2B-39B-b160kmodel· 70k dl· ♡ 30870k dl♡ 308
- 🤗timm/vit_base_patch32_clip_224.laion2b_ft_in1kmodel· 59 dl· ♡ 159 dl♡ 1
- 🤗timm/vit_large_patch14_clip_224.laion2b_ft_in1kmodel· 840 dl840 dl
- 🤗timm/vit_huge_patch14_clip_224.laion2b_ft_in1kmodel· 460 dl460 dl
- 🤗timm/vit_large_patch14_clip_224.laion2b_ft_in12k_in1kmodel· 1.2k dl1.2k dl
- 🤗timm/vit_huge_patch14_clip_224.laion2b_ft_in12k_in1kmodel· 1.3k dl· ♡ 21.3k dl♡ 2
- 🤗timm/vit_large_patch14_clip_224.laion2b_ft_in12kmodel· 73 dl73 dl
- 🤗timm/vit_huge_patch14_clip_224.laion2b_ft_in12kmodel· 59 dl· ♡ 159 dl♡ 1
- 🤗timm/vit_huge_patch14_clip_336.laion2b_ft_in12k_in1kmodel· 433 dl· ♡ 2433 dl♡ 2
- 🤗timm/vit_large_patch14_clip_336.laion2b_ft_in12k_in1kmodel· 1.7k dl· ♡ 11.7k dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsContrastive Language-Image Pre-training
