mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji, Zhang, Fei Huang, Jingren Zhou

TL;DR
mPLUG-Owl3 is a versatile multi-modal large language model designed to understand long image sequences, videos, and complex multi-image scenarios by integrating vision and language into a unified semantic space, achieving state-of-the-art results.
Contribution
The paper introduces novel hyper attention blocks for efficient vision-language integration, enabling long sequence understanding in multi-modal large language models.
Findings
Achieves state-of-the-art performance on various benchmarks.
Excels in ultra-long visual sequence processing.
Demonstrates robustness in distractor-rich environments.
Abstract
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗mPLUG/mPLUG-Owl3-7B-240728model· 1.3k dl· ♡ 431.3k dl♡ 43
- 🤗mPLUG/mPLUG-Owl3-2B-241014model· 101 dl· ♡ 6101 dl♡ 6
- 🤗mPLUG/mPLUG-Owl3-1B-241014model· 7 dl· ♡ 27 dl♡ 2
- 🤗mPLUG/mPLUG-Owl3-7B-241101model· 91 dl· ♡ 1091 dl♡ 10
- 🤗assentian1970/mplug3_dsdmodel· 2 dl2 dl
- 🤗assentian1970/mPLUG-Owl3-7B-241101-APImodel· 1 dl1 dl
- 🤗iqa-project/mPLUG-Owl3-7B-241101model· 7 dl7 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsSoftmax · Attention Is All You Need · Focus
