InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin, Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui,, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin, Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li

TL;DR
InternVL3 introduces a unified multimodal pre-training approach that jointly learns visual and linguistic capabilities, achieving state-of-the-art results on open-source MLLMs and maintaining strong language skills.
Contribution
It presents a native multimodal pre-training paradigm with advanced techniques, improving scalability, performance, and alignment over traditional post-hoc methods.
Findings
Achieves 72.2 on MMMU benchmark, setting new state-of-the-art for open-source MLLMs.
Demonstrates competitive performance with proprietary models like ChatGPT-4o and Claude 3.5.
Maintains strong pure-language proficiency while excelling in multimodal tasks.
Abstract
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗OpenGVLab/InternVL3_5-38Bmodel· 7.5k dl· ♡ 437.5k dl♡ 43
- 🤗OpenGVLab/InternVL3_5-8Bmodel· 46k dl· ♡ 9646k dl♡ 96
- 🤗OpenGVLab/InternVL3-78Bmodel· 40k dl· ♡ 23340k dl♡ 233
- 🤗OpenGVLab/InternVL3_5-241B-A28Bmodel· 430 dl· ♡ 136430 dl♡ 136
- 🤗OpenGVLab/InternVL3_5-30B-A3Bmodel· 109k dl· ♡ 42109k dl♡ 42
- 🤗OpenGVLab/InternVL3_5-38B-Instructmodel· 1.2k dl· ♡ 61.2k dl♡ 6
- 🤗OpenGVLab/InternVL3-38Bmodel· 76k dl· ♡ 4376k dl♡ 43
- 🤗OpenGVLab/InternVL3-14Bmodel· 44k dl· ♡ 7944k dl♡ 79
- 🤗OpenGVLab/InternVL3-8Bmodel· 110k dl· ♡ 103110k dl♡ 103
- 🤗OpenGVLab/InternVL3-9Bmodel· 8.5k dl· ♡ 258.5k dl♡ 25
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
