Improving Multimodal Datasets with Image Captioning
Thao Nguyen, Samir Yitzhak Gadre, Gabriel Ilharco, Sewoong Oh, Ludwig, Schmidt

TL;DR
This paper demonstrates that using generated image captions to augment web datasets improves vision-language model performance, surpassing filtering methods, and provides insights into caption quality and dataset curation at scale.
Contribution
It introduces a novel approach of mixing generated captions with raw data to enhance dataset quality and model performance, outperforming existing filtering methods.
Findings
Generated captions improve model accuracy on ImageNet and 38 tasks.
Synthetic captions enhance retrieval performance on Flickr and MS-COCO.
Caption quality metrics do not reliably predict their utility for training.
Abstract
Massive web datasets play a key role in the success of large vision-language models like CLIP and Flamingo. However, the raw web data is noisy, and existing filtering methods to reduce noise often come at the expense of data diversity. Our work focuses on caption quality as one major source of noise, and studies how generated captions can increase the utility of web-scraped datapoints with nondescript text. Through exploring different mixing strategies for raw and generated captions, we outperform the best filtering method proposed by the DataComp benchmark by 2% on ImageNet and 4% on average across 38 tasks, given a candidate pool of 128M image-text pairs. Our best approach is also 2x better at Flickr and MS-COCO retrieval. We then analyze what makes synthetic captions an effective source of text supervision. In experimenting with different image captioning models, we also demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques
MethodsContrastive Language-Image Pre-training
