Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and   Training Strategies

Zichao Li; Cihang Xie; Ekin Dogus Cubuk

arXiv:2404.08197·cs.CV·April 17, 2024·1 cites

Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies

Zichao Li, Cihang Xie, Ekin Dogus Cubuk

PDF

Open Access

TL;DR

This paper analyzes how data quality, model architecture, and training strategies impact the performance of scaled-down CLIP models, providing practical guidance for efficient training under limited compute resources.

Contribution

It offers a comprehensive analysis of CLIP scaling, highlighting the importance of high-quality data, suitable architecture choices, and effective training strategies for limited compute budgets.

Findings

01

High-quality data can outperform larger low-quality datasets.

02

Smaller ViT models are better for smaller datasets, larger models for bigger datasets.

03

CLIP+Data Augmentation achieves similar performance with half the data.

Abstract

This paper investigates the performance of the Contrastive Language-Image Pre-training (CLIP) when scaled down to limited computation budgets. We explore CLIP along three dimensions: data, architecture, and training strategies. With regards to data, we demonstrate the significance of high-quality training data and show that a smaller dataset of high-quality data can outperform a larger dataset with lower quality. We also examine how model performance varies with different dataset sizes, suggesting that smaller ViT models are better suited for smaller datasets, while larger models perform better on larger datasets with fixed compute. Additionally, we provide guidance on when to choose a CNN-based architecture or a ViT-based architecture for CLIP training. We compare four CLIP training strategies - SLIP, FLIP, CLIP, and CLIP+Data Augmentation - and show that the choice of training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques

MethodsFLIP · Contrastive Language-Image Pre-training