InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang; Zhangwei Gao; Lixin Gu; Hengjun Pu; Long Cui; Xingguang Wei; Zhaoyang Liu; Linglin Jing; Shenglong Ye; Jie Shao; Zhaokai Wang; Zhe Chen; Hongjie Zhang; Ganlin Yang; Haomin Wang; Qi Wei; Jinhui Yin; Wenhao Li; Erfei Cui; Guanzhou Chen; Zichen Ding; Changyao Tian; Zhenyu Wu; Jingjing Xie; Zehao Li; Bowen Yang; Yuchen Duan; Xuehui Wang; Zhi Hou; Haoran Hao; Tianyi Zhang; Songze Li; Xiangyu Zhao; Haodong Duan; Nianchen Deng; Bin Fu; Yinan He; Yi Wang; Conghui He; Botian Shi; Junjun He; Yingtong Xiong; Han Lv; Lijun Wu; Wenqi Shao; Kaipeng Zhang; Huipeng Deng; Biqing Qi; Jiaye Ge; Qipeng Guo; Wenwei Zhang; Songyang Zhang; Maosong Cao; Junyao Lin; Kexian Tang; Jianfei Gao; Haian Huang; Yuzhe Gu; Chengqi Lyu; Huanze Tang; Rui Wang; Haijun Lv; Wanli Ouyang; Limin Wang; Min Dou; Xizhou Zhu; Tong Lu; Dahua Lin; Jifeng Dai; Weijie Su; Bowen Zhou; Kai Chen; Yu Qiao; Wenhai Wang; Gen Luo

arXiv:2508.18265·cs.CV·August 28, 2025

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian

PDF

10 Models 3 Datasets

TL;DR

InternVL3.5 introduces innovative training and deployment strategies for open-source multimodal models, significantly improving reasoning, efficiency, and versatility, and achieving state-of-the-art results in various tasks.

Contribution

The paper presents Cascade RL, ViR, and DvD strategies that enhance reasoning, optimize efficiency, and enable new capabilities in open-source multimodal models.

Findings

01

Up to 16% improvement in reasoning performance

02

4.05x faster inference speed

03

State-of-the-art results on multiple benchmarks

Abstract

We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL) framework, which enhances reasoning through a two-stage process: offline RL for stable convergence and online RL for refined alignment. This coarse-to-fine training strategy leads to substantial improvements on downstream reasoning tasks, e.g., MMMU and MathVista. To optimize efficiency, we propose a Visual Resolution Router (ViR) that dynamically adjusts the resolution of visual tokens without compromising performance. Coupled with ViR, our Decoupled Vision-Language Deployment (DvD) strategy separates the vision encoder and language model across different GPUs, effectively balancing computational load. These contributions collectively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.