Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai,, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei, Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang, Lin

TL;DR
Qwen2-VL introduces dynamic resolution processing and advanced multimodal fusion techniques, significantly improving visual perception and performance across various benchmarks in vision-language models.
Contribution
The paper presents the Naive Dynamic Resolution mechanism and Multimodal Rotary Position Embedding, enabling flexible image processing and effective multimodal fusion in large-scale vision-language models.
Findings
Qwen2-VL-72B achieves performance comparable to GPT-4o and Claude3.5-Sonnet.
Dynamic resolution processing improves visual representation efficiency.
Scaling model size and data enhances multimodal benchmark results.
Abstract
We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Qwen/Qwen3-VL-8B-Instructmodel· 4.5M dl· ♡ 8474.5M dl♡ 847
- 🤗Qwen/Qwen2.5-VL-7B-Instructmodel· 4.5M dl· ♡ 14804.5M dl♡ 1480
- 🤗Qwen/Qwen3-VL-8B-Thinkingmodel· 97k dl· ♡ 19997k dl♡ 199
- 🤗Qwen/Qwen2.5-VL-3B-Instructmodel· 6.6M dl· ♡ 6316.6M dl♡ 631
- 🤗Qwen/Qwen2.5-VL-72B-Instructmodel· 103k dl· ♡ 604103k dl♡ 604
- 🤗Qwen/Qwen3-VL-30B-A3B-Instructmodel· 4.8M dl· ♡ 5554.8M dl♡ 555
- 🤗Qwen/Qwen3-VL-4B-Instructmodel· 2.0M dl· ♡ 3632.0M dl♡ 363
- 🤗Qwen/Qwen3-VL-32B-Instructmodel· 1.4M dl· ♡ 1931.4M dl♡ 193
- 🤗Qwen/Qwen3-VL-8B-Instruct-GGUFmodel· 43k dl· ♡ 7643k dl♡ 76
- 🤗unsloth/Qwen2.5-VL-7B-Instruct-GGUFmodel· 69k dl· ♡ 15169k dl♡ 151
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCategorization, perception, and language
