Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment
Rui Xu, Yunke Wang, Yong Luo, Bo Du

TL;DR
This paper introduces VisionDrop, a training-free visual token pruning method for LVLMs that improves efficiency by selecting informative visual tokens based solely on visual attention, addressing cross-modal misalignment issues.
Contribution
The paper proposes a novel, training-free visual-only pruning framework that enhances token reduction in LVLMs without relying on textual signals or additional training.
Findings
VisionDrop achieves a 2.7x reduction in inference latency.
It reduces FLOPs by 6x while maintaining 95.71% of original performance.
The method outperforms existing token reduction approaches across benchmarks.
Abstract
Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial computational overhead and limiting the scalability of LVLMs in practice. Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs). However, most in-LLM reduction approaches rely on text-conditioned interactions, implicitly assuming that textual tokens can reliably capture the importance of visual tokens. In this work, we revisit this assumption and reveal causal, semantic, and spatial forms of cross-modal misalignment. These misalignments undermine the effectiveness of text-guided visual token reduction. To address this, we introduce VisionDrop, a training-free, visual-only pruning framework that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
