mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
Haiyang Xu, Qinghao Ye, Ming Yan, Yaya Shi, Jiabo Ye, Yuanhong Xu,, Chenliang Li, Bin Bi, Qi Qian, Wei Wang, Guohai Xu, Ji Zhang, Songfang Huang,, Fei Huang, Jingren Zhou

TL;DR
mPLUG-2 introduces a modular multi-modal foundation model that effectively integrates text, image, and video understanding and generation, achieving state-of-the-art results across diverse tasks with improved flexibility and efficiency.
Contribution
It proposes a modularized design for multi-modal pretraining, enabling flexible task-specific module selection and addressing modality entanglement, which is a novel approach in multi-modal modeling.
Findings
Achieves state-of-the-art results on MSRVTT video QA and captioning.
Demonstrates strong zero-shot transferability across tasks.
Operates with a smaller model size and data scale.
Abstract
Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration while addressing the problem of modality entanglement. In contrast to predominant paradigms of solely relying on sequence-to-sequence generation or encoder-based instance discrimination, mPLUG-2 introduces a multi-module composition network by sharing common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video. Empirical study shows that mPLUG-2 achieves state-of-the-art or competitive results on a broad range of over 30…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning
