Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang

TL;DR
This paper introduces VisPruner, a novel token pruning method for vision-language models that uses visual cues instead of attention scores, significantly reducing computation while maintaining performance.
Contribution
VisPruner is a new plug-and-play approach that effectively prunes visual tokens by selecting significant ones based on visual cues and removing duplicates, outperforming attention-based methods.
Findings
Reduces FLOPs of LLaVA-1.5-7B by 91% without training.
Decreases inference latency by 75%.
Maintains comparable performance across various architectures.
Abstract
Large vision-language models (LVLMs) generally contain significantly more visual tokens than their textual counterparts, resulting in a considerable computational burden. Recent efforts have been made to tackle this issue by pruning visual tokens early within the language model. Most existing works use attention scores between text and visual tokens to assess the importance of visual tokens. However, in this study, we first analyze the text-visual attention in the language model and find that this score is not an ideal indicator for token pruning. Based on the analysis, We propose VisPruner, a plug-and-play method that utilizes visual cues for more effective token pruning in LVLMs. Specifically, we first use visual attention to select a limited number of significant tokens. Then, we remove duplicate tokens from the remaining ones based on their similarity. By retaining diverse tokens…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRetinal Imaging and Analysis · Image Processing Techniques and Applications · Advanced Vision and Imaging
MethodsSoftmax · Attention Is All You Need · Pruning
