WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
Yuqi Huo, Manli Zhang, Guangzhen Liu, Haoyu Lu, Yizhao Gao, Guoxing, Yang, Jingyuan Wen, Heng Zhang, Baogui Xu, Weihao Zheng, Zongzheng Xi,, Yueqian Yang, Anwen Hu, Jinming Zhao, Ruichen Li, Yida Zhao, Liang Zhang,, Yuqing Song, Xin Hong, Wanqing Cui, Danyang Hou, Yingyan Li

TL;DR
WenLan introduces BriVL, a large-scale multi-modal pre-training model that implicitly models cross-modal correlations using a contrastive learning framework, improving vision-language understanding especially in Chinese contexts.
Contribution
The paper presents BriVL, a novel two-tower model with an advanced MoCo-based contrastive learning algorithm for large-scale Chinese multi-modal pre-training.
Findings
BriVL outperforms UNITER and CLIP on downstream tasks.
Constructed a large Chinese multi-source image-text dataset RUC-CAS-WenLan.
Effective large-scale pre-training with limited GPU resources.
Abstract
Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsContrastive Learning · WenLan · UNiversal Image-TExt Representation Learning · Contrastive Language-Image Pre-training · InfoNCE · Batch Normalization · Momentum Contrast
