Introducing Visual Perception Token into Multimodal Large Language Model
Runpeng Yu, Xinyin Ma, Xinchao Wang

TL;DR
This paper introduces Visual Perception Tokens that enable multimodal large language models to autonomously control and refine their visual perception, leading to significant improvements in spatial reasoning and fine-grained understanding tasks.
Contribution
It proposes a novel mechanism of Visual Perception Tokens, including Region Selection and Vision Re-Encoding tokens, allowing MLLMs to autonomously manage visual perception processes.
Findings
Improved spatial reasoning and fine-grained understanding performance.
Enhanced model accuracy by 23.6% on average with Visual Perception Tokens.
Outperforms larger models without additional parameters.
Abstract
To utilize visual information, Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder. The completeness and accuracy of visual perception significantly influence the precision of spatial reasoning, fine-grained understanding, and other tasks. However, MLLM still lacks the autonomous capability to control its own visual perception processes, for example, selectively reviewing specific regions of an image or focusing on information related to specific object categories. In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate text, and use them to trigger additional visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗rp-yu/Qwen2-VL-7b-VPT-CLIPmodel· 140 dl· ♡ 1140 dl♡ 1
- 🤗rp-yu/Qwen2-VL-2b-VPT-Segmodel· 8 dl· ♡ 18 dl♡ 1
- 🤗rp-yu/Qwen2-VL-2b-VPT-CLIPmodel· 8 dl· ♡ 18 dl♡ 1
- 🤗rp-yu/Qwen2-VL-2b-VPT-Detmodel· 2 dl2 dl
- 🤗rp-yu/Qwen2-VL-2b-VPT-Det-NoPromptmodel· 4 dl4 dl
- 🤗rp-yu/Qwen2-VL-2b-VPT-Seg-Alignmentmodel· 3 dl3 dl
- 🤗rp-yu/Qwen2-VL-2b-VPT-Det-Alignmentmodel· 4 dl4 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
