IVC-Prune: Revealing the Implicit Visual Coordinates in LVLMs for Vision Token Pruning
Zhichao Sun, Yidong Ma, Gang Liu, Yibo Chen, Xu Tang, Yao Hu, Yongchao Xu

TL;DR
This paper reveals that LVLMs implicitly establish visual coordinate systems via Rotary Position Embeddings, and introduces IVC-Prune, a training-free token pruning method that preserves spatial reasoning tokens, reducing computational cost while maintaining performance.
Contribution
The paper uncovers the implicit visual coordinate system in LVLMs and proposes IVC-Prune, a novel token pruning strategy that retains spatial reasoning tokens without additional training.
Findings
Reduces visual tokens by ~50% while maintaining ≥99% performance.
Identifies IVC tokens through mathematical analysis of RoPE.
Achieves performance improvements on several benchmarks.
Abstract
Large Vision-Language Models (LVLMs) achieve impressive performance across multiple tasks. A significant challenge, however, is their prohibitive inference cost when processing high-resolution visual inputs. While visual token pruning has emerged as a promising solution, existing methods that primarily focus on semantic relevance often discard tokens that are crucial for spatial reasoning. We address this gap through a novel insight into \emph{how LVLMs process spatial reasoning}. Specifically, we reveal that LVLMs implicitly establish visual coordinate systems through Rotary Position Embeddings (RoPE), where specific token positions serve as \textbf{implicit visual coordinates} (IVC tokens) that are essential for spatial reasoning. Based on this insight, we propose \textbf{IVC-Prune}, a training-free, prompt-aware pruning strategy that retains both IVC tokens and semantically relevant…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper introduces a new perspective on token pruning by focusing on the IVC tokens for spatial reasoning in LVLMs. The idea of implicit visual coordinate from RoPE is novel. 2. The experiment results are impressive. The performance can approach or even surpass the vanilla model under 50% compression.
1. The paper lacks comparison with more recent baselines. FastV is already an weaker baseline, and it would be beneficial to include a comparison with SparseVLM. 2. The paper does not clarify how IVC tokens and foreground tokens should be allocated under a 50% total budget. Ablation experiments should be included to investigate this.
Unlike most previous methods on this area, this paper actually conducted a nice theoretically analysis of the working mechanism in VLMs, and the strong empirical results further supported the analysis. The pruning strategy is well designed, once the tokens for pruning are decided in the first forward pass, all layers KV cache can be cleaned to maximize the saving in compute and memory. The experiments with 4 different VLMs with different architectures, image handling strategies all show very
I don’t have any major concerns. One minor issue is that the methods depends on the property of RoPE, thus the generalizability to other model architectures with different position embeddings is unknown.
1. The proposed approach seems to be simple yet effective and can be applied to LVLMs with different architectures. 2. Extensive experimental results validate the effectiveness of the proposed approach. 3. The paper is generally well-written and the stucture is clear.
1. While the empirical experiments demonstrate the significant impact of the IVC tokens, I would be more convinced if a more detailed analysis were provided to explain *why* these tokens are important. 2. Relevant baselines employing window-based token selection approaches should be included in the main experiments, as they also aim to preserve spatial information along with the foreground tokens. Additionally, a discussion on novelty is needed to better differentiate this work from those relate
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Advanced Image and Video Retrieval Techniques
