Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies
Zichao Li, Cihang Xie, Ekin Dogus Cubuk

TL;DR
This paper analyzes how data quality, model architecture, and training strategies impact the performance of scaled-down CLIP models, providing practical guidance for efficient training under limited compute resources.
Contribution
It offers a comprehensive analysis of CLIP scaling, highlighting the importance of high-quality data, suitable architecture choices, and effective training strategies for limited compute budgets.
Findings
High-quality data can outperform larger low-quality datasets.
Smaller ViT models are better for smaller datasets, larger models for bigger datasets.
CLIP+Data Augmentation achieves similar performance with half the data.
Abstract
This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) when scaled down to limited computation budgets. We explore CLIP along three dimensions: data, architecture, and training strategies. With regards to data, we demonstrate the significance of high-quality training data and show that a smaller dataset of high-quality data can outperform a larger dataset with lower quality. We also examine how model performance varies with different dataset sizes, suggesting that smaller ViT models are better suited for smaller datasets, while larger models perform better on larger datasets with fixed compute. Additionally, we provide guidance on when to choose a CNN-based architecture or a ViT-based architecture for CLIP training. We compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data Augmentation - and show that the choice of training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
MethodsFLIP · Contrastive Language-Image Pre-training
