Qwen-VL: A Versatile Vision-Language Model for Understanding,   Localization, Text Reading, and Beyond

Jinze Bai; Shuai Bai; Shusheng Yang; Shijie Wang; Sinan Tan; Peng; Wang; Junyang Lin; Chang Zhou; Jingren Zhou

arXiv:2308.12966·cs.CV·October 16, 2023·136 cites

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng, Wang, Junyang Lin, Chang Zhou, Jingren Zhou

PDF

Open Access 2 Repos 10 Models 1 Datasets

TL;DR

Qwen-VL series are large-scale vision-language models capable of understanding, localizing, reading text, and more, achieving state-of-the-art results across various visual tasks and outperforming existing chatbots in real-world scenarios.

Contribution

Introduces Qwen-VL models with novel visual perception components, training pipeline, and multilingual corpus, advancing generalist vision-language understanding and interaction.

Findings

01

Set new records on visual-centric benchmarks

02

Excel in zero-shot and few-shot learning scenarios

03

Outperform existing vision-language chatbots in real-world benchmarks

Abstract

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

mehti/LMOD-Cataract-1K-surgical-analysis-cot
dataset· 2.1k dl
2.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques