ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models

Pu Zhang; Yuwei Li; Xingyuan Xian; Guoming Tang

arXiv:2510.17197·cs.CV·October 21, 2025

ZSPAPrune: Zero-Shot Prompt-Aware Token Pruning for Vision-Language Models

Pu Zhang, Yuwei Li, Xingyuan Xian, Guoming Tang

PDF

Open Access 3 Reviews

TL;DR

ZSPAPrune introduces a zero-shot, prompt-aware token pruning method for vision-language models that effectively reduces computational costs while maintaining high accuracy by balancing task relevance and diversity.

Contribution

It presents a novel prompt-aware token pruning approach that explicitly models task relevance, outperforming existing methods in efficiency and accuracy preservation.

Findings

01

Achieves up to 90% token pruning with minimal accuracy loss.

02

Reduces GPU memory and inference latency significantly.

03

Matches or surpasses state-of-the-art performance on multiple benchmarks.

Abstract

As the capabilities of Vision-Language Models (VLMs) advance, they can process increasingly large inputs, which, unlike in LLMs, generates significant visual token redundancy and leads to prohibitive inference costs. While many methods aim to reduce these costs by pruning visual tokens, existing approaches, whether based on attention or diversity, typically neglect the guidance of the text prompt and thus fail to prioritize task relevance. In this work, we propose a novel, zero-shot method that reframes the problem by introducing a prompt-aware perspective, explicitly modeling visual token pruning as a balance between task relevance and information diversity. Our hierarchical approach first selects a core set of task-relevant visual tokens and then supplements them with diversity tokens to preserve broader context. Experiments across multiple models and benchmarks show that our method…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

The strengths are as follows: 1.The paper is easy to read and the method is easy to follow. 2.Evaluated datasets and vLLMs are diverse.

Weaknesses

The weakness are as follows: 1.There are many existing works on task relevance of token pruning for vLLMs. This work additionally considers the information diversity, which seems incremental novelty. Meanwhile, in Figure 1, it is not easy to understand why the information diversity is useful for token pruning task. 2.Missing related works. Recently, there are many other token pruning methods[1,2,3,4], which are not analyzed and discussed in this work. These works should also be added for co

Reviewer 02Rating 4Confidence 4

Strengths

1. From a perspective of prompt-aware token selection to balance task relevance and information diversity in visual representations. 2. Introducing a hierarchical pruning mechanism composed of Prompt Simplification, Prompt-Aware Selection, and Diversity Balance to achieve controllable token reduction. 3. Achieving significant inference efficiency improvements with minimal accuracy loss under zero-shot settings across multiple Vision-Language Models and benchmarks.

Weaknesses

1. The paper lacks comparison with other methods that explicitly address the trade-off between task relevance and information diversity. Without such comparison, it remains unclear whether the proposed balance strategy is superior or merely heuristic. 2. As a plug-and-play method, ZSPAPrune should be validated on more models with different parameter scales to confirm its general applicability. The current experiments are limited to a narrow range of architectures, reducing the evidence of scalab

Reviewer 03Rating 4Confidence 3

Strengths

The paper presents a clear, zero-shot pruning method that balances prompt relevance and visual diversity, which prior work did not. Experiments across strong VLMs and multiple benchmarks show it maintains or improves accuracy under extreme pruning while reducing cost. The method is practically significant because it can be dropped into existing VLMs without any retraining or architectural changes.

Weaknesses

The paper does not report direct quantitative comparisons against strong prompt-aware pruning baselines (e.g., GlimpsePrune), so it is hard to verify that the proposed approach is actually better than the closest prior work. The efficiency claims are based on a single model/setting and only at an extreme 90% pruning ratio, with limited analysis of where latency and memory savings come from or how they scale with pruning level. The method is essentially heuristic and lacks a clear formal objec

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications