GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models

Ruiguang Pei; Weiqing Sun; Zhihui Fu; Jun Wang

arXiv:2506.13166·cs.CV·June 17, 2025

GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models

Ruiguang Pei, Weiqing Sun, Zhihui Fu, Jun Wang

PDF

Open Access

TL;DR

GreedyPrune is a training-free visual token pruning method for large vision language models that balances semantic importance and visual diversity, reducing computation while maintaining high accuracy.

Contribution

It introduces a novel greedy algorithm for joint optimization of semantic saliency and visual diversity in token pruning, improving efficiency and accuracy.

Findings

01

Achieves state-of-the-art accuracy on multiple multimodal tasks

02

Reduces inference latency significantly

03

Maintains high semantic integrity with aggressive pruning

Abstract

Although Large Vision Language Models (LVLMs) have demonstrated remarkable performance in image understanding tasks, their computational efficiency remains a significant challenge, particularly on resource-constrained devices due to the high cost of processing large numbers of visual tokens. Recently, training-free visual token pruning methods have gained popularity as a low-cost solution to this issue. However, existing approaches suffer from two key limitations: semantic saliency-based strategies primarily focus on high cross-attention visual tokens, often neglecting visual diversity, whereas visual diversity-based methods risk inadvertently discarding semantically important tokens, especially under high compression ratios. In this paper, we introduce GreedyPrune, a training-free plug-and-play visual token pruning algorithm designed to jointly optimize semantic saliency and visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques

MethodsPruning · Focus