Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Gen Luo; Wenhan Dou; Wenhao Li; Zhaokai Wang; Xue Yang; Changyao Tian; Hao Li; Weiyun Wang; Wenhai Wang; Xizhou Zhu; Yu Qiao; Jifeng Dai

arXiv:2507.12566·cs.CV·July 18, 2025

Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Gen Luo, Wenhan Dou, Wenhao Li, Zhaokai Wang, Xue Yang, Changyao Tian, Hao Li, Weiyun Wang, Wenhai Wang, Xizhou Zhu, Yu Qiao, Jifeng Dai

PDF

Open Access 4 Models 1 Datasets

TL;DR

This paper introduces Mono-InternVL-1.5, a cost-effective, faster monolithic multimodal large language model that maintains high performance through innovative training and architecture improvements.

Contribution

It presents Mono-InternVL-1.5, a novel monolithic MLLM with enhanced efficiency, stability, and performance, achieved via improved pre-training, expert organization, and inference acceleration techniques.

Findings

01

Outperforms existing MLLMs on 12 of 15 benchmarks

02

Reduces training and inference costs significantly

03

Achieves up to 69% latency reduction compared to modular models

Abstract

This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

OpenGVLab/Mono-InternVL-2B-Synthetic-Data
dataset· 60 dl
60 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis