Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng, Wang, Junyang Lin, Chang Zhou, Jingren Zhou

TL;DR
Qwen-VL series are large-scale vision-language models capable of understanding, localizing, reading text, and more, achieving state-of-the-art results across various visual tasks and outperforming existing chatbots in real-world scenarios.
Contribution
Introduces Qwen-VL models with novel visual perception components, training pipeline, and multilingual corpus, advancing generalist vision-language understanding and interaction.
Findings
Set new records on visual-centric benchmarks
Excel in zero-shot and few-shot learning scenarios
Outperform existing vision-language chatbots in real-world benchmarks
Abstract
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Qwen/Qwen3-VL-8B-Instructmodel· 4.5M dl· ♡ 8474.5M dl♡ 847
- 🤗Qwen/Qwen2.5-VL-7B-Instructmodel· 4.5M dl· ♡ 14804.5M dl♡ 1480
- 🤗Qwen/Qwen3-VL-8B-Thinkingmodel· 97k dl· ♡ 19997k dl♡ 199
- 🤗Qwen/Qwen2.5-VL-3B-Instructmodel· 6.6M dl· ♡ 6316.6M dl♡ 631
- 🤗Qwen/Qwen2.5-VL-72B-Instructmodel· 103k dl· ♡ 604103k dl♡ 604
- 🤗Qwen/Qwen3-VL-30B-A3B-Instructmodel· 4.8M dl· ♡ 5554.8M dl♡ 555
- 🤗Qwen/Qwen3-VL-4B-Instructmodel· 2.0M dl· ♡ 3632.0M dl♡ 363
- 🤗Qwen/Qwen3-VL-32B-Instructmodel· 1.4M dl· ♡ 1931.4M dl♡ 193
- 🤗Qwen/Qwen3-VL-8B-Instruct-GGUFmodel· 43k dl· ♡ 7643k dl♡ 76
- 🤗unsloth/Qwen2.5-VL-7B-Instruct-GGUFmodel· 69k dl· ♡ 15169k dl♡ 151
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
