Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at   Any Resolution

Peng Wang; Shuai Bai; Sinan Tan; Shijie Wang; Zhihao Fan; Jinze Bai,; Keqin Chen; Xuejing Liu; Jialin Wang; Wenbin Ge; Yang Fan; Kai Dang; Mengfei; Du; Xuancheng Ren; Rui Men; Dayiheng Liu; Chang Zhou; Jingren Zhou; Junyang; Lin

arXiv:2409.12191·cs.CV·October 4, 2024·73 cites

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai,, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei, Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang, Lin

PDF

Open Access 5 Repos 10 Models 1 Datasets

TL;DR

Qwen2-VL introduces dynamic resolution processing and advanced multimodal fusion techniques, significantly improving visual perception and performance across various benchmarks in vision-language models.

Contribution

The paper presents the Naive Dynamic Resolution mechanism and Multimodal Rotary Position Embedding, enabling flexible image processing and effective multimodal fusion in large-scale vision-language models.

Findings

01

Qwen2-VL-72B achieves performance comparable to GPT-4o and Claude3.5-Sonnet.

02

Dynamic resolution processing improves visual representation efficiency.

03

Scaling model size and data enhances multimodal benchmark results.

Abstract

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

mehti/LMOD-Cataract-1K-surgical-analysis-cot
dataset· 2.1k dl
2.1k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCategorization, perception, and language