Qwen2.5-VL Technical Report

Shuai Bai; Keqin Chen; Xuejing Liu; Jialin Wang; Wenbin Ge; Sibo Song,; Kai Dang; Peng Wang; Shijie Wang; Jun Tang; Humen Zhong; Yuanzhi Zhu; Mingkun; Yang; Zhaohai Li; Jianqiang Wan; Pengfei Wang; Wei Ding; Zheren Fu; Yiheng; Xu; Jiabo Ye; Xi Zhang; Tianbao Xie; Zesen Cheng; Hang Zhang; Zhibo Yang,; Haiyang Xu; Junyang Lin (additional authors not shown)

arXiv:2502.13923·cs.CV·March 5, 2025·51 cites

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song,, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun, Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng, Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng

PDF

Open Access 4 Repos 10 Models 5 Datasets

TL;DR

Qwen2.5-VL is a new vision-language model that advances visual recognition, document parsing, and video understanding, enabling detailed spatial and temporal analysis for real-world applications.

Contribution

It introduces a native dynamic-resolution ViT with Window Attention, enhancing multi-modal understanding and processing of complex visual and document data.

Findings

01

Achieves state-of-the-art performance in document and diagram understanding.

02

Handles videos up to hours long with second-level event localization.

03

Maintains strong linguistic capabilities alongside visual functionalities.

Abstract

We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemiconductor Lasers and Optical Devices

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax