UNIMO-3: Multi-granularity Interaction for Vision-Language   Representation Learning

Hao Yang; Can Gao; Hao L\'iu; Xinyan Xiao; Yanyan Zhao; Bing Qin

arXiv:2305.13697·cs.CL·May 24, 2023·1 cites

UNIMO-3: Multi-granularity Interaction for Vision-Language Representation Learning

Hao Yang, Can Gao, Hao L\'iu, Xinyan Xiao, Yanyan Zhao, Bing Qin

PDF

Open Access

TL;DR

UNIMO-3 introduces a multi-granularity interaction model for vision-language pre-training, enabling effective in-layer and cross-layer multimodal interactions, leading to state-of-the-art performance on downstream tasks.

Contribution

It proposes a novel UNIMO-3 model that captures multi-level cross-modal interactions through cross-layer connections, enhancing multimodal representation learning.

Findings

01

Achieves state-of-the-art results on various downstream tasks.

02

Cross-layer learning improves multimodal representation.

03

Effective in-layer and cross-layer interactions are crucial for performance.

Abstract

Vision-and-language (VL) pre-training, which aims to learn a general representation of image-text pairs that can be transferred to various vision-and-language tasks. Compared with modeling uni-modal data, the main challenge of the VL model is: how to learn the cross-modal interaction from multimodal data, especially the fine-grained interaction. Existing works have shown that fully transformer-based models that adopt attention mechanisms to learn in-layer cross-model interaction can demonstrate impressive performance on various cross-modal downstream tasks. However, they ignored that the semantic information of the different modals at the same layer was not uniform, which leads to the cross-modal interaction collapsing into a limited multi-modal semantic information interaction. In this work, we propose the UNIMO-3 model, which has the capacity to simultaneously learn the multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning