CCMB: A Large-scale Chinese Cross-modal Benchmark

Chunyu Xie; Heng Cai; Jincheng Li; Fanjing Kong; Xiaoyu Wu; Jianfei; Song; Henrique Morimitsu; Lin Yao; Dexin Wang; Xiangzheng Zhang; Dawei Leng,; Baochang Zhang; Xiangyang Ji; Yafeng Deng

arXiv:2205.03860·cs.CV·November 9, 2023·1 cites

CCMB: A Large-scale Chinese Cross-modal Benchmark

Chunyu Xie, Heng Cai, Jincheng Li, Fanjing Kong, Xiaoyu Wu, Jianfei, Song, Henrique Morimitsu, Lin Yao, Dexin Wang, Xiangzheng Zhang, Dawei Leng,, Baochang Zhang, Xiangyang Ji, Yafeng Deng

PDF

Open Access 1 Repo

TL;DR

This paper introduces CCMB, the largest Chinese cross-modal benchmark dataset, and R2D2, a novel vision-language pre-training framework, achieving state-of-the-art results across multiple Chinese vision-language tasks.

Contribution

The work provides the first large-scale Chinese cross-modal dataset CCMB and a new VLP framework R2D2, advancing Chinese vision-language research and performance.

Findings

01

Achieved state-of-the-art results on 12 Chinese vision-language tasks.

02

Created the largest Chinese cross-modal dataset with 250M images and 750M texts.

03

Developed a novel pre-training framework with ranking and distillation strategies.

Abstract

Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various downstream tasks. In contrast to plenty of available benchmarks with English corpus, large-scale pre-training datasets and downstream datasets with Chinese corpus remain largely unexplored. In this work, we build a large-scale high-quality Chinese Cross-Modal Benchmark named CCMB for the research community, which contains the currently largest public pre-training dataset Zero and five human-annotated fine-tuning datasets for downstream tasks. Zero contains 250 million images paired with 750 million text descriptions, plus two of the five fine-tuning datasets are also currently the largest ones for Chinese cross-modal downstream tasks. Along with the CCMB, we also develop a VLP framework named R2D2, applying a pre-Ranking + Ranking strategy to learn powerful vision-language representations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuxie11/R2D2
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsRecurrent Replay Distributed DQN