Introducing Visual Perception Token into Multimodal Large Language Model

Runpeng Yu; Xinyin Ma; Xinchao Wang

arXiv:2502.17425·cs.CV·February 25, 2025

Introducing Visual Perception Token into Multimodal Large Language Model

Runpeng Yu, Xinyin Ma, Xinchao Wang

PDF

Open Access 1 Repo 7 Models

TL;DR

This paper introduces Visual Perception Tokens that enable multimodal large language models to autonomously control and refine their visual perception, leading to significant improvements in spatial reasoning and fine-grained understanding tasks.

Contribution

It proposes a novel mechanism of Visual Perception Tokens, including Region Selection and Vision Re-Encoding tokens, allowing MLLMs to autonomously manage visual perception processes.

Findings

01

Improved spatial reasoning and fine-grained understanding performance.

02

Enhanced model accuracy by 23.6% on average with Visual Perception Tokens.

03

Outperforms larger models without additional parameters.

Abstract

To utilize visual information, Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder. The completeness and accuracy of visual perception significantly influence the precision of spatial reasoning, fine-grained understanding, and other tasks. However, MLLM still lacks the autonomous capability to control its own visual perception processes, for example, selectively reviewing specific regions of an image or focusing on information related to specific object categories. In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate text, and use them to trigger additional visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yu-rp/visualperceptiontoken
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications