mPLUG-Owl: Modularization Empowers Large Language Models with   Multimodality

Qinghao Ye; Haiyang Xu; Guohai Xu; Jiabo Ye; Ming Yan; Yiyang Zhou,; Junyang Wang; Anwen Hu; Pengcheng Shi; Yaya Shi; Chenliang Li; Yuanhong Xu,; Hehong Chen; Junfeng Tian; Qi Qian; Ji Zhang; Fei Huang; Jingren Zhou

arXiv:2304.14178·cs.CL·April 1, 2024·168 cites

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou,, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, Chenliang Li, Yuanhong Xu,, Hehong Chen, Junfeng Tian, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou

PDF

Open Access 1 Repo 1 Models

TL;DR

mPLUG-Owl introduces a modular training paradigm for large language models, enabling multi-modal capabilities like visual understanding and reasoning, outperforming existing models in instruction following and multi-turn conversations.

Contribution

The paper presents a novel two-stage training method that modularly integrates visual modules with LLMs, enhancing multi-modal abilities without compromising language generation.

Findings

01

Outperforms existing multi-modal models in instruction and visual understanding.

02

Demonstrates multi-turn conversation and knowledge reasoning abilities.

03

Exhibits unexpected skills like multi-image correlation and scene text understanding.

Abstract

Large language models (LLMs) have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

x-plug/mplug-owl
pytorchOfficial

Models

🤗
0xDing/yuren-baichuan-7b
model· 9 dl· ♡ 27
9 dl♡ 27

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsALIGN