DataComp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase,, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh,, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek, Ramanujan, Yonatan Bitton, Kalyani Marathe

TL;DR
DataComp introduces a large-scale multimodal dataset benchmark with 12.8 billion image-text pairs, enabling systematic evaluation and improvement of dataset quality for training models like CLIP, leading to better downstream performance.
Contribution
It provides a standardized benchmark and workflow for dataset curation and evaluation, fostering research on dataset design for multimodal learning.
Findings
Best baseline achieves 79.2% zero-shot ImageNet accuracy
DataComp-1B outperforms OpenAI's CLIP ViT-L/14 by 3.7 percentage points
Benchmark spans multiple compute scales for diverse research accessibility
Abstract
Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90Kmodel· 125k dl· ♡ 123125k dl♡ 123
- 🤗gadgetsam/CLIP-ViT-L-14-DataComp.XL-s13B-b90Kmodel· 2 dl2 dl
- 🤗Aixile/CLIP-ViT-L-14-DataComp.XL-s13B-b90Kmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗laion/CLIP-ViT-B-16-DataComp.XL-s13B-b90Kmodel· 11k dl· ♡ 811k dl♡ 8
- 🤗timm/vit_large_patch14_clip_336.laion2b_ft_in12k_in1k_inat21model· 23 dl23 dl
- 🤗timm/vit_large_patch14_clip_336.datacompxl_ft_inat21model· 90 dl· ♡ 190 dl♡ 1
- 🤗timm/eva02_large_patch14_clip_336.merged2b_ft_inat21model· 2.8k dl· ♡ 122.8k dl♡ 12
- 🤗flavour/CLIP-ViT-B-16-DataComp.XL-s13B-b90Kmodel· 4.7k dl· ♡ 14.7k dl♡ 1
- 🤗timm/vit_large_patch14_clip_336.laion2b_ft_augreg_inat21model· 178 dl178 dl
- 🤗timm/convnext_large_mlp.laion2b_ft_augreg_inat21model· 109 dl109 dl
Videos
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Generative Adversarial Networks and Image Synthesis
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization
