Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)
Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal, Shankar, Achal Dave, Ludwig Schmidt

TL;DR
This paper investigates the causes of robustness in contrastively trained language-image models like CLIP, finding that training data diversity is the primary factor behind their improved performance across distribution shifts.
Contribution
The study systematically analyzes various factors influencing robustness in CLIP-like models and introduces ImageNet-Captions for controlled language-image training experiments.
Findings
Diverse training data is the main driver of robustness.
Other factors like training set size and language supervision contribute minimally.
Introduction of ImageNet-Captions enables further research on language-image training effects.
Abstract
Contrastively trained language-image models such as CLIP, ALIGN, and BASIC have demonstrated unprecedented robustness to multiple challenging natural distribution shifts. Since these language-image models differ from previous training approaches in several ways, an important question is what causes the large robustness gains. We answer this question via a systematic experimental investigation. Concretely, we study five different possible causes for the robustness gains: (i) the training set size, (ii) the training distribution, (iii) language supervision at training time, (iv) language supervision at test time, and (v) the contrastive loss function. Our experiments show that the more diverse training distribution is the main cause for the robustness gains, with the other factors contributing little to no robustness. Beyond our experimental results, we also introduce ImageNet-Captions, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
MethodsALIGN · Contrastive Language-Image Pre-training
