UNIMO-3: Multi-granularity Interaction for Vision-Language Representation Learning
Hao Yang, Can Gao, Hao L\'iu, Xinyan Xiao, Yanyan Zhao, Bing Qin

TL;DR
UNIMO-3 introduces a multi-granularity interaction model for vision-language pre-training, enabling effective in-layer and cross-layer multimodal interactions, leading to state-of-the-art performance on downstream tasks.
Contribution
It proposes a novel UNIMO-3 model that captures multi-level cross-modal interactions through cross-layer connections, enhancing multimodal representation learning.
Findings
Achieves state-of-the-art results on various downstream tasks.
Cross-layer learning improves multimodal representation.
Effective in-layer and cross-layer interactions are crucial for performance.
Abstract
Vision-and-language (VL) pre-training, which aims to learn a general representation of image-text pairs that can be transferred to various vision-and-language tasks. Compared with modeling uni-modal data, the main challenge of the VL model is: how to learn the cross-modal interaction from multimodal data, especially the fine-grained interaction. Existing works have shown that fully transformer-based models that adopt attention mechanisms to learn in-layer cross-model interaction can demonstrate impressive performance on various cross-modal downstream tasks. However, they ignored that the semantic information of the different modals at the same layer was not uniform, which leads to the cross-modal interaction collapsing into a limited multi-modal semantic information interaction. In this work, we propose the UNIMO-3 model, which has the capacity to simultaneously learn the multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
