DataComp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre; Gabriel Ilharco; Alex Fang; Jonathan Hayase,; Georgios Smyrnis; Thao Nguyen; Ryan Marten; Mitchell Wortsman; Dhruba Ghosh,; Jieyu Zhang; Eyal Orgad; Rahim Entezari; Giannis Daras; Sarah Pratt; Vivek; Ramanujan; Yonatan Bitton; Kalyani Marathe; Stephen Mussmann; Richard Vencu,; Mehdi Cherti; Ranjay Krishna; Pang Wei Koh; Olga Saukh; Alexander Ratner,; Shuran Song; Hannaneh Hajishirzi; Ali Farhadi; Romain Beaumont; Sewoong Oh,; Alex Dimakis; Jenia Jitsev; Yair Carmon; Vaishaal Shankar; Ludwig Schmidt

arXiv:2304.14108·cs.CV·October 23, 2023·74 cites

DataComp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase,, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh,, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek, Ramanujan, Yonatan Bitton, Kalyani Marathe

PDF

Open Access 3 Repos 10 Models 2 Datasets 1 Video

TL;DR

DataComp introduces a large-scale multimodal dataset benchmark with 12.8 billion image-text pairs, enabling systematic evaluation and improvement of dataset quality for training models like CLIP, leading to better downstream performance.

Contribution

It provides a standardized benchmark and workflow for dataset curation and evaluation, fostering research on dataset design for multimodal learning.

Findings

01

Best baseline achieves 79.2% zero-shot ImageNet accuracy

02

DataComp-1B outperforms OpenAI's CLIP ViT-L/14 by 3.7 percentage points

03

Benchmark spans multiple compute scales for diverse research accessibility

Abstract

Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

DataComp: In search of the next generation of multimodal datasets· slideslive

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Generative Adversarial Networks and Image Synthesis

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Adam · Residual Connection · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization