When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs

Yahong Wang; Juncheng Wu; Zhangkai Ni; Longzhen Yang; Yihang Liu; Chengmei Yang; Ying Wen; Lianghua He; Xianfeng Tang; Hui Liu; Yuyin Zhou

arXiv:2512.07580·cs.CV·March 10, 2026

When Token Pruning is Worse than Random: Understanding Visual Token Information in VLLMs

Yahong Wang, Juncheng Wu, Zhangkai Ni, Longzhen Yang, Yihang Liu, Chengmei Yang, Ying Wen, Lianghua He, Xianfeng Tang, Hui Liu, Yuyin Zhou

PDF

Open Access

TL;DR

This paper investigates why token pruning in Vision Large Language Models often underperforms in deep layers, revealing a phenomenon called 'information horizon' where visual tokens lose their salience, and proposes simple random pruning as an effective solution.

Contribution

The study introduces a new metric to quantify token information, identifies the 'information horizon' phenomenon, and demonstrates that random pruning can outperform existing methods in deep layers.

Findings

01

Tokens lose information with depth, vanishing beyond the 'information horizon'

02

Deeper horizons occur in visually intensive tasks like OCR

03

Random pruning maintains performance while reducing tokens by 50%

Abstract

Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by \textbf{``vanishing token information''}, where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token's information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications