Ovis2.5 Technical Report

Shiyin Lu; Yang Li; Yu Xia; Yuwei Hu; Shanshan Zhao; Yanqing Ma; Zhichao Wei; Yinglun Li; Lunhao Duan; Jianshan Zhao; Yuxuan Han; Haijun Li; Wanying Chen; Junke Tang; Chengkun Hou; Zhixing Du; Tianli Zhou; Wenjie Zhang; Huping Ding; Jiahe Li; Wen Li; Gui Hu; Yiliang Gu; Siran Yang; Jiamang Wang; Hailong Sun; Yibo Wang; Hui Sun; Jinlong Huang; Yuping He; Shengze Shi; Weihong Zhang; Guodong Zheng; Junpeng Jiang; Sensen Gao; Yi-Feng Wu; Sijia Chen; Yuhui Chen; Qing-Guo Chen; Zhao Xu; Weihua Luo; Kaifu Zhang

arXiv:2508.11737·cs.CV·August 19, 2025

Ovis2.5 Technical Report

Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanying Chen, Junke Tang, Chengkun Hou, Zhixing Du, Tianli Zhou, Wenjie Zhang, Huping Ding, Jiahe Li, Wen Li, Gui Hu, Yiliang Gu

PDF

Open Access 10 Models

TL;DR

Ovis2.5 is a multimodal vision-language model with native-resolution perception and advanced reasoning, achieving state-of-the-art results in open-source large models and excelling in complex visual tasks.

Contribution

The paper introduces Ovis2.5, featuring native-resolution vision processing, reflection-based reasoning, a comprehensive training curriculum, and open-source models that outperform previous open models.

Findings

01

Ovis2.5-9B scores 78.3 on OpenCompass leaderboard.

02

Ovis2.5-2B achieves 73.9, SOTA for its size.

03

Ovis2.5 excels in STEM, grounding, video, and complex chart tasks.

Abstract

We present Ovis2.5, a successor to Ovis2 designed for native-resolution visual perception and strong multimodal reasoning. Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, avoiding the degradation from fixed-resolution tiling and preserving both fine detail and global layout -- crucial for visually dense content like complex charts. To strengthen reasoning, we train the model to move beyond linear chain-of-thought and perform reflection -- including self-checking and revision. This advanced capability is exposed as an optional "thinking mode" at inference time, allowing users to trade latency for enhanced accuracy on difficult inputs. The model is trained via a comprehensive five-phase curriculum that progressively builds its skills. The process begins with foundational visual and multimodal pretraining, advances…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEngineering Applied Research