mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image   and Video

Haiyang Xu; Qinghao Ye; Ming Yan; Yaya Shi; Jiabo Ye; Yuanhong Xu,; Chenliang Li; Bin Bi; Qi Qian; Wei Wang; Guohai Xu; Ji Zhang; Songfang Huang,; Fei Huang; Jingren Zhou

arXiv:2302.00402·cs.CV·May 12, 2023·50 cites

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu,, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, Guohai Xu, Ji Zhang, Songfang Huang,, Fei Huang, Jingren Zhou

PDF

Open Access 4 Repos 1 Video

TL;DR

mPLUG-2 introduces a modular multi-modal foundation model that effectively integrates text, image, and video understanding and generation, achieving state-of-the-art results across diverse tasks with improved flexibility and efficiency.

Contribution

It proposes a modularized design for multi-modal pretraining, enabling flexible task-specific module selection and addressing modality entanglement, which is a novel approach in multi-modal modeling.

Findings

01

Achieves state-of-the-art results on MSRVTT video QA and captioning.

02

Demonstrates strong zero-shot transferability across tasks.

03

Operates with a smaller model size and data scale.

Abstract

Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning