Qwen2.5-VL Technical Report
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song,, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun, Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng, Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng

TL;DR
Qwen2.5-VL is a new vision-language model that advances visual recognition, document parsing, and video understanding, enabling detailed spatial and temporal analysis for real-world applications.
Contribution
It introduces a native dynamic-resolution ViT with Window Attention, enhancing multi-modal understanding and processing of complex visual and document data.
Findings
Achieves state-of-the-art performance in document and diagram understanding.
Handles videos up to hours long with second-level event localization.
Maintains strong linguistic capabilities alongside visual functionalities.
Abstract
We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Qwen/Qwen3-VL-8B-Instructmodel· 4.5M dl· ♡ 8474.5M dl♡ 847
- 🤗Qwen/Qwen3-VL-8B-Thinkingmodel· 97k dl· ♡ 19997k dl♡ 199
- 🤗Qwen/Qwen3-VL-30B-A3B-Instructmodel· 4.8M dl· ♡ 5554.8M dl♡ 555
- 🤗Qwen/Qwen3-VL-4B-Instructmodel· 2.0M dl· ♡ 3632.0M dl♡ 363
- 🤗Qwen/Qwen3-VL-32B-Instructmodel· 1.4M dl· ♡ 1931.4M dl♡ 193
- 🤗Qwen/Qwen3-VL-8B-Instruct-GGUFmodel· 43k dl· ♡ 7643k dl♡ 76
- 🤗Qwen/Qwen3-VL-8B-Instruct-FP8model· 510k dl· ♡ 66510k dl♡ 66
- 🤗Qwen/Qwen3-VL-2B-Instructmodel· 2.4M dl· ♡ 3562.4M dl♡ 356
- 🤗unsloth/Qwen3-VL-30B-A3B-Instruct-GGUFmodel· 57k dl· ♡ 9057k dl♡ 90
- 🤗Qwen/Qwen3-VL-235B-A22B-Thinkingmodel· 350k dl· ♡ 385350k dl♡ 385
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemiconductor Lasers and Optical Devices
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax
