ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models
Pu Zhang, Yuwei Li, Xingyuan Xian, Guoming Tang

TL;DR
ZSPAPrune introduces a zero-shot, prompt-aware token pruning method for vision-language models that effectively reduces computational costs while maintaining high accuracy by balancing task relevance and diversity.
Contribution
It presents a novel prompt-aware token pruning approach that explicitly models task relevance, outperforming existing methods in efficiency and accuracy preservation.
Findings
Achieves up to 90% token pruning with minimal accuracy loss.
Reduces GPU memory and inference latency significantly.
Matches or surpasses state-of-the-art performance on multiple benchmarks.
Abstract
As the capabilities of Vision-Language Models (VLMs) advance, they can process increasingly large inputs, which, unlike in LLMs, generates significant visual token redundancy and leads to prohibitive inference costs. While many methods aim to reduce these costs by pruning visual tokens, existing approaches, whether based on attention or diversity, typically neglect the guidance of the text prompt and thus fail to prioritize task relevance. In this work, we propose a novel, zero-shot method that reframes the problem by introducing a prompt-aware perspective, explicitly modeling visual token pruning as a balance between task relevance and information diversity. Our hierarchical approach first selects a core set of task-relevant visual tokens and then supplements them with diversity tokens to preserve broader context. Experiments across multiple models and benchmarks show that our method…
Peer Reviews
Decision·Submitted to ICLR 2026
The strengths are as follows: 1.The paper is easy to read and the method is easy to follow. 2.Evaluated datasets and vLLMs are diverse.
The weakness are as follows: 1.There are many existing works on task relevance of token pruning for vLLMs. This work additionally considers the information diversity, which seems incremental novelty. Meanwhile, in Figure 1, it is not easy to understand why the information diversity is useful for token pruning task. 2.Missing related works. Recently, there are many other token pruning methods[1,2,3,4], which are not analyzed and discussed in this work. These works should also be added for co
1. From a perspective of prompt-aware token selection to balance task relevance and information diversity in visual representations. 2. Introducing a hierarchical pruning mechanism composed of Prompt Simplification, Prompt-Aware Selection, and Diversity Balance to achieve controllable token reduction. 3. Achieving significant inference efficiency improvements with minimal accuracy loss under zero-shot settings across multiple Vision-Language Models and benchmarks.
1. The paper lacks comparison with other methods that explicitly address the trade-off between task relevance and information diversity. Without such comparison, it remains unclear whether the proposed balance strategy is superior or merely heuristic. 2. As a plug-and-play method, ZSPAPrune should be validated on more models with different parameter scales to confirm its general applicability. The current experiments are limited to a narrow range of architectures, reducing the evidence of scalab
The paper presents a clear, zero-shot pruning method that balances prompt relevance and visual diversity, which prior work did not. Experiments across strong VLMs and multiple benchmarks show it maintains or improves accuracy under extreme pruning while reducing cost. The method is practically significant because it can be dropped into existing VLMs without any retraining or architectural changes.
The paper does not report direct quantitative comparisons against strong prompt-aware pruning baselines (e.g., GlimpsePrune), so it is hard to verify that the proposed approach is actually better than the closest prior work. The efficiency claims are based on a single model/setting and only at an extreme 90% pruning ratio, with limited analysis of where latency and memory savings come from or how they scale with pruning level. The method is essentially heuristic and lacks a clear formal objec
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
