Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment

Rui Xu; Yunke Wang; Yong Luo; Bo Du

arXiv:2506.22283·cs.CV·March 3, 2026

Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment

Rui Xu, Yunke Wang, Yong Luo, Bo Du

PDF

Open Access 1 Video

TL;DR

This paper introduces VisionDrop, a training-free visual token pruning method for LVLMs that improves efficiency by selecting informative visual tokens based solely on visual attention, addressing cross-modal misalignment issues.

Contribution

The paper proposes a novel, training-free visual-only pruning framework that enhances token reduction in LVLMs without relying on textual signals or additional training.

Findings

01

VisionDrop achieves a 2.7x reduction in inference latency.

02

It reduces FLOPs by 6x while maintaining 95.71% of original performance.

03

The method outperforms existing token reduction approaches across benchmarks.

Abstract

Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial computational overhead and limiting the scalability of LVLMs in practice. Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs). However, most in-LLM reduction approaches rely on text-conditioned interactions, implicitly assuming that textual tokens can reliably capture the importance of visual tokens. In this work, we revisit this assumption and reveal causal, semantic, and spatial forms of cross-modal misalignment. These misalignments undermine the effectiveness of text-guided visual token reduction. To address this, we introduce VisionDrop, a training-free, visual-only pruning framework that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning