VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

Ce Zhang; Kaixin Ma; Tianqing Fang; Wenhao Yu; Hongming Zhang; Zhisong Zhang; Haitao Mi; Dong Yu

arXiv:2505.22654·cs.CV·February 2, 2026

VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Haitao Mi, Dong Yu

PDF

Open Access

TL;DR

VScan introduces a two-stage visual token reduction method that significantly accelerates large vision-language models by intelligently pruning tokens during encoding and decoding, maintaining high performance with reduced computational costs.

Contribution

The paper proposes VScan, a novel framework for visual token reduction that combines global and local scans with token merging, improving efficiency without sacrificing accuracy.

Findings

01

VScan achieves 2.91× speedup in pre-filling for LLaVA-NeXT-7B.

02

VScan reduces FLOPs by 10× while retaining 95.4% of original performance.

03

Extensive experiments validate VScan's effectiveness across multiple benchmarks.

Abstract

Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications

MethodsPruning