WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models
Sha Yuan, Shuai Zhao, Jiahong Leng, Zhao Xue, Hanyu Zhao, Peiyu Liu,, Zheng Gong, Wayne Xin Zhao, Junyi Li, Jie Tang

TL;DR
WuDaoMM is a large-scale multi-modal dataset with over 650 million image-text pairs, designed to enhance vision-language pre-training models, especially for text-to-image generation tasks.
Contribution
The paper introduces WuDaoMM, a comprehensive multi-modal dataset with diverse data sources and sizes, supporting improved pre-training of vision-language models.
Findings
WuDaoMM improves model performance on downstream tasks.
The dataset enhances text-to-image generation quality.
Training with WuDaoMM accelerates model convergence.
Abstract
Compared with the domain-specific model, the vision-language pre-training models (VLPMs) have shown superior performance on downstream tasks with fast fine-tuning process. For example, ERNIE-ViL, Oscar and UNIMO trained VLPMs with a uniform transformers stack architecture and large amounts of image-text paired data, achieving remarkable results on downstream tasks such as image-text reference(IR and TR), vision question answering (VQA) and image captioning (IC) etc. During the training phase, VLPMs are always fed with a combination of multiple public datasets to meet the demand of large-scare training data. However, due to the unevenness of data distribution including size, task type and quality, using the mixture of multiple datasets for model training can be problematic. In this work, we introduce a large-scale multi-modal corpora named WuDaoMM, totally containing more than 650M…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques
MethodsBalanced Selection · Crossmodal Contrastive Learning · UNIMO · OSCAR
