GreedyPrune: Retenting Critical Visual Token Set for Large Vision Language Models
Ruiguang Pei, Weiqing Sun, Zhihui Fu, Jun Wang

TL;DR
GreedyPrune is a training-free visual token pruning method for large vision language models that balances semantic importance and visual diversity, reducing computation while maintaining high accuracy.
Contribution
It introduces a novel greedy algorithm for joint optimization of semantic saliency and visual diversity in token pruning, improving efficiency and accuracy.
Findings
Achieves state-of-the-art accuracy on multiple multimodal tasks
Reduces inference latency significantly
Maintains high semantic integrity with aggressive pruning
Abstract
Although Large Vision Language Models (LVLMs) have demonstrated remarkable performance in image understanding tasks, their computational efficiency remains a significant challenge, particularly on resource-constrained devices due to the high cost of processing large numbers of visual tokens. Recently, training-free visual token pruning methods have gained popularity as a low-cost solution to this issue. However, existing approaches suffer from two key limitations: semantic saliency-based strategies primarily focus on high cross-attention visual tokens, often neglecting visual diversity, whereas visual diversity-based methods risk inadvertently discarding semantically important tokens, especially under high compression ratios. In this paper, we introduce GreedyPrune, a training-free plug-and-play visual token pruning algorithm designed to jointly optimize semantic saliency and visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
MethodsPruning · Focus
