UNIMO: Towards Unified-Modal Understanding and Generation via   Cross-Modal Contrastive Learning

Wei Li; Can Gao; Guocheng Niu; Xinyan Xiao; Hao Liu; Jiachen Liu; Hua; Wu; Haifeng Wang

arXiv:2012.15409·cs.CL·March 15, 2022·32 cites

UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua, Wu, Haifeng Wang

PDF

Open Access 3 Repos

TL;DR

UNIMO is a unified pre-training model that effectively integrates and leverages large-scale single-modal and multi-modal data through cross-modal contrastive learning to enhance understanding and generation tasks across modalities.

Contribution

It introduces a unified-modal pre-training architecture that adapts to both single-modal and multi-modal tasks, utilizing large-scale data and cross-modal contrastive learning for improved performance.

Findings

01

Significantly improves performance on various downstream tasks.

02

Effectively aligns visual and textual information into a shared semantic space.

03

Utilizes large-scale unpaired data for more generalizable representations.

Abstract

Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e. text or image) or limited multi-modal data (i.e. image-text pairs). In this work, we propose a unified-modal pre-training architecture, namely UNIMO, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections can be utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space over a corpus of image-text pairs. As the non-paired single-modal data is very rich, our model can utilize much larger scale of data to learn more generalizable representations. Moreover, the textual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsContrastive Learning · Crossmodal Contrastive Learning · UNIMO