InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian

TL;DR
InternVL3.5 introduces innovative training and deployment strategies for open-source multimodal models, significantly improving reasoning, efficiency, and versatility, and achieving state-of-the-art results in various tasks.
Contribution
The paper presents Cascade RL, ViR, and DvD strategies that enhance reasoning, optimize efficiency, and enable new capabilities in open-source multimodal models.
Findings
Up to 16% improvement in reasoning performance
4.05x faster inference speed
State-of-the-art results on multiple benchmarks
Abstract
We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗OpenGVLab/InternVL3_5-38Bmodel· 7.5k dl· ♡ 437.5k dl♡ 43
- 🤗OpenGVLab/InternVL3_5-8Bmodel· 46k dl· ♡ 9646k dl♡ 96
- 🤗OpenGVLab/InternVL3_5-241B-A28Bmodel· 430 dl· ♡ 136430 dl♡ 136
- 🤗OpenGVLab/InternVL3_5-30B-A3Bmodel· 109k dl· ♡ 42109k dl♡ 42
- 🤗OpenGVLab/InternVL3_5-38B-Instructmodel· 1.2k dl· ♡ 61.2k dl♡ 6
- 🤗OpenGVLab/InternVL3_5-241B-A28B-MPOmodel· 25 dl· ♡ 225 dl♡ 2
- 🤗OpenGVLab/InternVL3_5-241B-A28B-Pretrainedmodel· 31 dl· ♡ 131 dl♡ 1
- 🤗OpenGVLab/InternVL3_5-241B-A28B-Instructmodel· 54 dl· ♡ 1554 dl♡ 15
- 🤗OpenGVLab/InternVL3_5-38B-MPOmodel· 37 dl· ♡ 237 dl♡ 2
- 🤗OpenGVLab/InternVL3_5-38B-Pretrainedmodel· 31 dl· ♡ 231 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
