Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity

Zhengyao Fang; Pengyuan Lyu; Chengquan Zhang; Guangming Lu; Jun Yu; Wenjie Pei

arXiv:2603.09480·cs.CV·March 12, 2026

Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity

Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guangming Lu, Jun Yu, Wenjie Pei

PDF

Open Access 3 Reviews

TL;DR

PruneSID introduces a training-free, two-stage token compression method for vision-language models that balances importance and diversity, significantly reducing tokens while maintaining high accuracy across image and video tasks.

Contribution

It proposes a novel importance-diversity based pruning framework with a dynamic compression mechanism, achieving state-of-the-art results without additional training.

Findings

01

Achieves 96.3% accuracy on LLaVA-1.5 with only 11.1% tokens.

02

Outperforms prior methods by 2.5% accuracy at extreme compression.

03

Provides 7.8× faster prefilling speed compared to original models.

Abstract

Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance importance preservation and information diversity. To address this, we propose PruneSID, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principal Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, PruneSID incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

This paper makes a solid and well-executed contribution to efficient vision-language modeling. Its originality lies in the creative combination of semantic grouping and redundancy pruning within a training-free framework, offering a fresh perspective on token compression. The experimental work is thorough and convincing, with strong empirical support across multiple models and benchmarks. The writing is generally clear and the presentation effective, though some technical sections are dense. Ove

Weaknesses

While the paper is strong overall, several weaknesses limit its impact and generality. First, the conceptual novelty is somewhat incremental. The theoretical grounding is relatively weak — PRUNESID is motivated intuitively, but lacks a formal analysis of why PSCA and NMS together should optimally balance semantic importance and diversity.

Reviewer 02Rating 6Confidence 4

Strengths

1. The writing is fluent and overall clear. 2. The performance is well validated on both **LLaVA-1.5** and **LLaVA-NeXT**.

Weaknesses

1. The performance on LLaVA-1.5 and LLaVA-NeXT appears relatively weak; please provide comparisons on Qwen2.5-VL instead. 2. Your method performs compression mainly after the vision encoder. I’m curious about how it would behave when combined with compression techniques applied during the LLM stage, such as PyramidDrop, which already adopts a multi-stage framework. Could such a combination achieve a more extreme level of compression by eliminating redundancy more thoroughly at each stage? Furth

Reviewer 03Rating 4Confidence 4

Strengths

1. The proposed method is training-free and can be seamlessly integrated into existing vision–language models. This makes it practical and easily deployable in real-time inference scenarios. 2. The two-stage design—PSCA for semantic grouping and intra-group NMS for redundancy suppression—offers a clean and interpretable way to capture both salient and diverse information. The structured distinct from previous one-sided importance- or diversity-only methods. 3. The paper is well-organized, with

Weaknesses

1. Generalization yet to be verified: The paper lacks experiments on different models and numbers of parameters; effectiveness on other architectures (e.g., LLaVA-OV[1], InstructBLIP[2], Qwen-VL[3]) and other numbers of parameters(e.g., LLaVA-13B) remains to be validated. 2. Baseline selection is not enough: The comparison with existing methods is not entirely up-to-date. Therefore, more methods should be compared, for example, the VisPruner[4], CDPruner[5], and so on. [1] Bo Li, Yuanhan Zhan

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Advanced Data Compression Techniques