WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models

Sha Yuan; Shuai Zhao; Jiahong Leng; Zhao Xue; Hanyu Zhao; Peiyu Liu,; Zheng Gong; Wayne Xin Zhao; Junyi Li; Jie Tang

arXiv:2203.11480·cs.CV·May 3, 2022·1 cites

WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models

Sha Yuan, Shuai Zhao, Jiahong Leng, Zhao Xue, Hanyu Zhao, Peiyu Liu,, Zheng Gong, Wayne Xin Zhao, Junyi Li, Jie Tang

PDF

Open Access

TL;DR

WuDaoMM is a large-scale multi-modal dataset with over 650 million image-text pairs, designed to enhance vision-language pre-training models, especially for text-to-image generation tasks.

Contribution

The paper introduces WuDaoMM, a comprehensive multi-modal dataset with diverse data sources and sizes, supporting improved pre-training of vision-language models.

Findings

01

WuDaoMM improves model performance on downstream tasks.

02

The dataset enhances text-to-image generation quality.

03

Training with WuDaoMM accelerates model convergence.

Abstract

Compared with the domain-specific model, the vision-language pre-training models (VLPMs) have shown superior performance on downstream tasks with fast fine-tuning process. For example, ERNIE-ViL, Oscar and UNIMO trained VLPMs with a uniform transformers stack architecture and large amounts of image-text paired data, achieving remarkable results on downstream tasks such as image-text reference(IR and TR), vision question answering (VQA) and image captioning (IC) etc. During the training phase, VLPMs are always fed with a combination of multiple public datasets to meet the demand of large-scare training data. However, due to the unevenness of data distribution including size, task type and quality, using the mixture of multiple datasets for model training can be problematic. In this work, we introduce a large-scale multi-modal corpora named WuDaoMM, totally containing more than 650M…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques

MethodsBalanced Selection · Crossmodal Contrastive Learning · UNIMO · OSCAR