VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Senqiao Yang; Junyi Li; Xin Lai; Bei Yu; Hengshuang Zhao; Jiaya Jia

arXiv:2507.13348·cs.CV·July 18, 2025

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning

Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia

PDF

Open Access 2 Models 4 Datasets 1 Video

TL;DR

VisionThink introduces a dynamic visual token compression method for vision-language models, using reinforcement learning to adaptively decide image resolution, improving efficiency and accuracy across diverse tasks.

Contribution

It proposes a novel adaptive token compression paradigm with reinforcement learning, enabling models to selectively process images at different resolutions based on task complexity.

Findings

01

Achieves strong performance on OCR-related tasks with reduced tokens.

02

Saves computational resources on simpler tasks.

03

Demonstrates superior efficiency and effectiveness over existing methods.

Abstract

Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications