Demystifying CLIP Data
Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer

TL;DR
This paper reveals the data curation process behind CLIP, introduces MetaCLIP for balanced data selection, and demonstrates its superior performance on benchmarks using curated datasets.
Contribution
The work uncovers CLIP's data collection approach, proposes MetaCLIP for metadata-driven data curation, and shows improved benchmark results with curated data.
Findings
MetaCLIP outperforms CLIP on multiple benchmarks.
Curated data improves zero-shot ImageNet accuracy.
Scaling data with MetaCLIP enhances performance without increasing training budget.
Abstract
Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings,…
Peer Reviews
Decision·ICLR 2024 spotlight
- The paper tackles an important problem in the community regarding the opacity of data curation processes of foundation models. Moreover, it promises to open-source part of the efforts, including data curation code and training data distribution. - The paper is also strong in terms of empirical evaluation, including various data sizes, and different implementations of the balancing steps. Huge resources are devoted to the evaluation part.
- The main weakness lies in the technical novelty. The paper is a commendable effort to reproduce the data curation pipeline that has already been described in the original CLIP paper (Radford et al, 2021), and report the findings for reproducing an existing data curation technique. The paper's novelty can be greatly enhanced by exploring some new technical components beyond what's already described in Radford et al 2021.
1. The effort to reproduce the exact data construction procedure of the original CLIP paper is well motivated and appreciated. As the authors pointed out, the later datasets like LAION or DataComp, all adopt trained CLIP model during data collection. How to build a high-quality and diverse image-text dataset from scratch, like WIT400M, is still a mystery to the community. 2. Given that CLIP is such a important foundation model that connects image and text, the data crafting pipeline and the resu
1. I find that the authors use the average accuracy across multiple datasets as a major performance metric throughout the paper. This is examplified by table 4/5 in the main texts, and also some tables in the appendix. This does not make sense to me. Those datasets come with different number of classes and number of samples. For instance, averaging the accuracy of a dataset of 10 classes (e.g. EuroSAT), and a dataset of 102 classes (e.g. Flowers), is unreasonble, because misclassifying all sampl
1. After almost three years the CLIP paper came out, it is great to see efforts following and investigating the data curation pipeline in the CLIP paper, whose proposed data diversification (balancing) was ignored by the other works, such as LAION. The paper would be interesting to the researchers working on data curation, and contrastive pre-training, too. 2. The experimental results are impressive and set the new state of the art.
1. The contribution of the paper is limited. The main contribution is the entry balancing used by CLIP. This balancing operation actually plays a similar role to the deduplication used in [1] and [2], where its effectiveness has been proven. 2. In Sec 3.4, when sub-sampling image-text pairs for each entry, in addition to the information density based rule, it is worth trying some model-based rules, e.g., image-text matching based rules. Although the paper mainly aims to reproduce the CLIP paper'
Code & Models
- 🤗facebook/metaclip-b32-400mmodel· 139k dl· ♡ 46139k dl♡ 46
- 🤗facebook/metaclip-b32-fullcc2.5bmodel· 185 dl· ♡ 9185 dl♡ 9
- 🤗facebook/metaclip-h14-fullcc2.5bmodel· 7.8k dl· ♡ 497.8k dl♡ 49
- 🤗facebook/metaclip-b16-fullcc2.5bmodel· 1.8k dl· ♡ 111.8k dl♡ 11
- 🤗facebook/metaclip-b16-400mmodel· 29 dl· ♡ 529 dl♡ 5
- 🤗facebook/metaclip-l14-fullcc2.5bmodel· 1.1k dl· ♡ 71.1k dl♡ 7
- 🤗facebook/metaclip-l14-400mmodel· 121 dl· ♡ 7121 dl♡ 7
- 🤗ericlewis/metaclip-h14-fullcc2.5bmodel· 1 dl1 dl
- 🤗cs-giung/clip-vit-base-patch32-fullcc2.5bmodel· 4 dl4 dl
- 🤗cs-giung/clip-vit-base-patch16-fullcc2.5bmodel
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Radiomics and Machine Learning in Medical Imaging
MethodsContrastive Language-Image Pre-training
