Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Soravit Changpinyo, Piyush Sharma, Nan Ding, Radu Soricut

TL;DR
This paper introduces Conceptual 12M, a large-scale image-text dataset for vision-and-language pre-training, demonstrating that increased data diversity and scale improve performance on long-tail visual recognition tasks.
Contribution
It presents the creation of CC12M dataset with relaxed data collection constraints, enabling larger and more diverse pre-training data for vision-and-language models.
Findings
CC12M improves downstream task performance
Scaling data enhances long-tail visual recognition
Achieves new state-of-the-art on nocaps and CC benchmarks
Abstract
The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pre-training. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pre-training data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [Sharma et al. 2018] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for vision-and-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗dalle-mini/dalle-minimodel· 192 dl· ♡ 396192 dl♡ 396
- 🤗flax-community/clip-vit-base-patch32_marian-esmodel· 2 dl2 dl
- 🤗flax-community/clip-vit-base-patch32_mbart-large-50model· 5 dl· ♡ 25 dl♡ 2
- 🤗caio13/dalle-monomodel
- 🤗airsat/dalle-minimodel· 9 dl· ♡ 19 dl♡ 1
- 🤗matrix11404/bopts_newsmodel· 1 dl1 dl
- 🤗raw9/dalle-minimodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
