A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models

Quan-Sheng Zeng; Yunheng Li; Qilong Wang; Peng-Tao Jiang; Zuxuan Wu; Ming-Ming Cheng; Qibin Hou

arXiv:2508.01548·cs.CV·August 5, 2025

A Glimpse to Compress: Dynamic Visual Token Pruning for Large Vision-Language Models

Quan-Sheng Zeng, Yunheng Li, Qilong Wang, Peng-Tao Jiang, Zuxuan Wu, Ming-Ming Cheng, Qibin Hou

PDF

Open Access 4 Models

TL;DR

This paper introduces GlimpsePrune, a dynamic visual token pruning method for large vision-language models that adaptively reduces tokens based on scene complexity, significantly improving efficiency without sacrificing accuracy.

Contribution

GlimpsePrune is a novel, data-driven, dynamic pruning framework inspired by human cognition, enabling adaptive token pruning in a single pass for LVLMs.

Findings

01

Prunes 92.6% of visual tokens with minimal performance loss.

02

GlimpsePrune+ achieves 110% of baseline performance with high pruning rate.

03

Reduces computational cost, enabling more effective fine-tuning.

Abstract

Visual token compression is critical for Large Vision-Language Models (LVLMs) to efficiently process high-resolution inputs. Existing methods that typically adopt fixed compression ratios cannot adapt to scenes of varying complexity, often causing imprecise pruning that discards informative visual tokens and results in degraded model performance. To address this issue, we introduce a dynamic pruning framework, GlimpsePrune, inspired by human cognition. It takes a data-driven ''glimpse'' and prunes irrelevant visual tokens in a single forward pass before answer generation. This approach prunes 92.6% of visual tokens while on average fully retaining the baseline performance on free-form VQA tasks. The reduced computational cost also enables more effective fine-tuning: an enhanced GlimpsePrune+ achieves 110% of the baseline performance while maintaining a similarly high pruning rate. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning