Data Determines Distributional Robustness in Contrastive Language Image   Pre-training (CLIP)

Alex Fang; Gabriel Ilharco; Mitchell Wortsman; Yuhao Wan; Vaishaal; Shankar; Achal Dave; Ludwig Schmidt

arXiv:2205.01397·cs.CV·August 24, 2022·27 cites

Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)

Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal, Shankar, Achal Dave, Ludwig Schmidt

PDF

Open Access 2 Repos

TL;DR

This paper investigates the causes of robustness in contrastively trained language-image models like CLIP, finding that training data diversity is the primary factor behind their improved performance across distribution shifts.

Contribution

The study systematically analyzes various factors influencing robustness in CLIP-like models and introduces ImageNet-Captions for controlled language-image training experiments.

Findings

01

Diverse training data is the main driver of robustness.

02

Other factors like training set size and language supervision contribute minimally.

03

Introduction of ImageNet-Captions enables further research on language-image training effects.

Abstract

Contrastively trained language-image models such as CLIP, ALIGN, and BASIC have demonstrated unprecedented robustness to multiple challenging natural distribution shifts. Since these language-image models differ from previous training approaches in several ways, an important question is what causes the large robustness gains. We answer this question via a systematic experimental investigation. Concretely, we study five different possible causes for the robustness gains: (i) the training set size, (ii) the training distribution, (iii) language supervision at training time, (iv) language supervision at test time, and (v) the contrastive loss function. Our experiments show that the more diverse training distribution is the main cause for the robustness gains, with the other factors contributing little to no robustness. Beyond our experimental results, we also introduce ImageNet-Captions, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsALIGN · Contrastive Language-Image Pre-training