Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

Qizhe Zhang; Aosong Cheng; Ming Lu; Renrui Zhang; Zhiyong Zhuo; Jiajun Cao; Shaobo Guo; Qi She; Shanghang Zhang

arXiv:2412.01818·cs.CV·May 13, 2025·3 cites

Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang

PDF

Open Access 2 Repos

TL;DR

This paper introduces VisPruner, a novel token pruning method for vision-language models that uses visual cues instead of attention scores, significantly reducing computation while maintaining performance.

Contribution

VisPruner is a new plug-and-play approach that effectively prunes visual tokens by selecting significant ones based on visual cues and removing duplicates, outperforming attention-based methods.

Findings

01

Reduces FLOPs of LLaVA-1.5-7B by 91% without training.

02

Decreases inference latency by 75%.

03

Maintains comparable performance across various architectures.

Abstract

Large vision-language models (LVLMs) generally contain significantly more visual tokens than their textual counterparts, resulting in a considerable computational burden. Recent efforts have been made to tackle this issue by pruning visual tokens early within the language model. Most existing works use attention scores between text and visual tokens to assess the importance of visual tokens. However, in this study, we first analyze the text-visual attention in the language model and find that this score is not an ideal indicator for token pruning. Based on the analysis, We propose VisPruner, a plug-and-play method that utilizes visual cues for more effective token pruning in LVLMs. Specifically, we first use visual attention to select a limited number of significant tokens. Then, we remove duplicate tokens from the remaining ones based on their similarity. By retaining diverse tokens…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRetinal Imaging and Analysis · Image Processing Techniques and Applications · Advanced Vision and Imaging

MethodsSoftmax · Attention Is All You Need · Pruning