InternVL3: Exploring Advanced Training and Test-Time Recipes for   Open-Source Multimodal Models

Jinguo Zhu; Weiyun Wang; Zhe Chen; Zhaoyang Liu; Shenglong Ye; Lixin; Gu; Hao Tian; Yuchen Duan; Weijie Su; Jie Shao; Zhangwei Gao; Erfei Cui,; Xuehui Wang; Yue Cao; Yangzhou Liu; Xingguang Wei; Hongjie Zhang; Haomin; Wang; Weiye Xu; Hao Li; Jiahao Wang; Nianchen Deng; Songze Li; Yinan He; Tan; Jiang; Jiapeng Luo; Yi Wang; Conghui He; Botian Shi; Xingcheng Zhang; Wenqi; Shao; Junjun He; Yingtong Xiong; Wenwen Qu; Peng Sun; Penglong Jiao; Han Lv,; Lijun Wu; Kaipeng Zhang; Huipeng Deng; Jiaye Ge; Kai Chen; Limin Wang; Min; Dou; Lewei Lu; Xizhou Zhu; Tong Lu; Dahua Lin; Yu Qiao; Jifeng Dai; Wenhai; Wang

arXiv:2504.10479·cs.CV·April 22, 2025·5 cites

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin, Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui,, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin, Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li

PDF

Open Access 1 Repo 10 Models 1 Datasets

TL;DR

InternVL3 introduces a unified multimodal pre-training approach that jointly learns visual and linguistic capabilities, achieving state-of-the-art results on open-source MLLMs and maintaining strong language skills.

Contribution

It presents a native multimodal pre-training paradigm with advanced techniques, improving scalability, performance, and alignment over traditional post-hoc methods.

Findings

01

Achieves 72.2 on MMMU benchmark, setting new state-of-the-art for open-source MLLMs.

02

Demonstrates competitive performance with proprietary models like ChatGPT-4o and Claude 3.5.

03

Maintains strong pure-language proficiency while excelling in multimodal tasks.

Abstract

We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opengvlab/internvl
pytorchOfficial

Models

Datasets

OpenGVLab/MMPR-v1.2-prompts
dataset· 2.1k dl
2.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning