Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training   Benchmark

Jiaxi Gu; Xiaojun Meng; Guansong Lu; Lu Hou; Minzhe Niu; Xiaodan; Liang; Lewei Yao; Runhui Huang; Wei Zhang; Xin Jiang; Chunjing Xu; Hang Xu

arXiv:2202.06767·cs.CV·September 30, 2022·29 cites

Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark

Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Minzhe Niu, Xiaodan, Liang, Lewei Yao, Runhui Huang, Wei Zhang, Xin Jiang, Chunjing Xu, Hang Xu

PDF

Open Access 1 Repo 1 Video

TL;DR

Wukong introduces a large-scale Chinese cross-modal dataset with 100 million image-text pairs, enabling benchmarking and advancing Chinese vision-language pre-training models with state-of-the-art results on multiple tasks.

Contribution

This work provides the first large-scale Chinese cross-modal dataset and benchmark, along with pre-trained models and techniques to enhance Chinese VLP research.

Findings

01

Wukong achieves 73.03% average accuracy on zero-shot image classification.

02

It outperforms previous models with a 12.9% higher recall on AIC-ICC.

03

Models are effective across multiple downstream datasets.

Abstract

Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream tasks. Their success heavily relies on the scale of pre-trained cross-modal datasets. However, the lack of large-scale datasets and benchmarks in Chinese hinders the development of Chinese VLP models and broader multilingual applications. In this work, we release a large-scale Chinese cross-modal dataset named Wukong, which contains 100 million Chinese image-text pairs collected from the web. Wukong aims to benchmark different multi-modal pre-training methods to facilitate the VLP research and community development. Furthermore, we release a group of models pre-trained with various image encoders (ViT-B/ViT-L/SwinT) and also apply advanced pre-training techniques into VLP such as locked-image text tuning, token-wise similarity in contrastive learning, and reduced-token interaction.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

0jason000/wukong
mindspore

Videos

Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsWenLan · Contrastive Language-Image Pre-training