WenLan: Bridging Vision and Language by Large-Scale Multi-Modal   Pre-Training

Yuqi Huo; Manli Zhang; Guangzhen Liu; Haoyu Lu; Yizhao Gao; Guoxing; Yang; Jingyuan Wen; Heng Zhang; Baogui Xu; Weihao Zheng; Zongzheng Xi,; Yueqian Yang; Anwen Hu; Jinming Zhao; Ruichen Li; Yida Zhao; Liang Zhang,; Yuqing Song; Xin Hong; Wanqing Cui; Danyang Hou; Yingyan Li; Junyi Li; Peiyu; Liu; Zheng Gong; Chuhao Jin; Yuchong Sun; Shizhe Chen; Zhiwu Lu; Zhicheng; Dou; Qin Jin; Yanyan Lan; Wayne Xin Zhao; Ruihua Song; and Ji-Rong Wen

arXiv:2103.06561·cs.CV·July 9, 2021·85 cites

WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing, Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, Zongzheng Xi,, Yueqian Yang, Anwen Hu, Jinming Zhao, Ruichen Li, Yida Zhao, Liang Zhang,, Yuqing Song, Xin Hong, Wanqing Cui, Danyang Hou, Yingyan Li

PDF

Open Access 2 Repos 2 Models

TL;DR

WenLan introduces BriVL, a large-scale multi-modal pre-training model that implicitly models cross-modal correlations using a contrastive learning framework, improving vision-language understanding especially in Chinese contexts.

Contribution

The paper presents BriVL, a novel two-tower model with an advanced MoCo-based contrastive learning algorithm for large-scale Chinese multi-modal pre-training.

Findings

01

BriVL outperforms UNITER and CLIP on downstream tasks.

02

Constructed a large Chinese multi-source image-text dataset RUC-CAS-WenLan.

03

Effective large-scale pre-training with limited GPU resources.

Abstract

Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsContrastive Learning · WenLan · UNiversal Image-TExt Representation Learning · Contrastive Language-Image Pre-training · InfoNCE · Batch Normalization · Momentum Contrast