Demystifying CLIP Data

Hu Xu; Saining Xie; Xiaoqing Ellen Tan; Po-Yao Huang; Russell Howes; Vasu Sharma; Shang-Wen Li; Gargi Ghosh; Luke Zettlemoyer; Christoph Feichtenhofer

arXiv:2309.16671·cs.CV·November 25, 2025·22 cites

Demystifying CLIP Data

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer

PDF

Open Access 2 Repos 10 Models 1 Datasets 1 Video 3 Reviews

TL;DR

This paper reveals the data curation process behind CLIP, introduces MetaCLIP for balanced data selection, and demonstrates its superior performance on benchmarks using curated datasets.

Contribution

The work uncovers CLIP's data collection approach, proposes MetaCLIP for metadata-driven data curation, and shows improved benchmark results with curated data.

Findings

01

MetaCLIP outperforms CLIP on multiple benchmarks.

02

Curated data improves zero-shot ImageNet accuracy.

03

Scaling data with MetaCLIP enhances performance without increasing training budget.

Abstract

Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings,…

Peer Reviews

Decision·ICLR 2024 spotlight

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

- The paper tackles an important problem in the community regarding the opacity of data curation processes of foundation models. Moreover, it promises to open-source part of the efforts, including data curation code and training data distribution. - The paper is also strong in terms of empirical evaluation, including various data sizes, and different implementations of the balancing steps. Huge resources are devoted to the evaluation part.

Weaknesses

- The main weakness lies in the technical novelty. The paper is a commendable effort to reproduce the data curation pipeline that has already been described in the original CLIP paper (Radford et al, 2021), and report the findings for reproducing an existing data curation technique. The paper's novelty can be greatly enhanced by exploring some new technical components beyond what's already described in Radford et al 2021.

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

1. The effort to reproduce the exact data construction procedure of the original CLIP paper is well motivated and appreciated. As the authors pointed out, the later datasets like LAION or DataComp, all adopt trained CLIP model during data collection. How to build a high-quality and diverse image-text dataset from scratch, like WIT400M, is still a mystery to the community. 2. Given that CLIP is such a important foundation model that connects image and text, the data crafting pipeline and the resu

Weaknesses

1. I find that the authors use the average accuracy across multiple datasets as a major performance metric throughout the paper. This is examplified by table 4/5 in the main texts, and also some tables in the appendix. This does not make sense to me. Those datasets come with different number of classes and number of samples. For instance, averaging the accuracy of a dataset of 10 classes (e.g. EuroSAT), and a dataset of 102 classes (e.g. Flowers), is unreasonble, because misclassifying all sampl

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

1. After almost three years the CLIP paper came out, it is great to see efforts following and investigating the data curation pipeline in the CLIP paper, whose proposed data diversification (balancing) was ignored by the other works, such as LAION. The paper would be interesting to the researchers working on data curation, and contrastive pre-training, too. 2. The experimental results are impressive and set the new state of the art.

Weaknesses

1. The contribution of the paper is limited. The main contribution is the entry balancing used by CLIP. This balancing operation actually plays a similar role to the deduplication used in [1] and [2], where its effectiveness has been proven. 2. In Sec 3.4, when sub-sampling image-text pairs for each entry, in addition to the information density based rule, it is worth trying some model-based rules, e.g., image-text matching based rules. Although the paper mainly aims to reproduce the CLIP paper'

Code & Models

Repositories

Models

Datasets

Mitsua/safe-commons-pd-3m
dataset· 125 dl
125 dl

Videos

Demystifying CLIP Data· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Radiomics and Machine Learning in Medical Imaging

MethodsContrastive Language-Image Pre-training