ConsensusDrop: Fusing Visual and Cross-Modal Saliency for Efficient Vision Language Models

Dhruv Parikh; Haoyang Fan; Rajgopal Kannan; Viktor Prasanna

arXiv:2602.00946·cs.CV·February 3, 2026

ConsensusDrop: Fusing Visual and Cross-Modal Saliency for Efficient Vision Language Models

Dhruv Parikh, Haoyang Fan, Rajgopal Kannan, Viktor Prasanna

PDF

Open Access

TL;DR

ConsensusDrop is a training-free method that fuses visual saliency and cross-modal attention to efficiently reduce tokens in vision-language models, improving performance and efficiency.

Contribution

It introduces a novel consensus-based token pruning framework that combines vision encoder saliency with cross-attention signals without additional training.

Findings

01

Outperforms prior pruning methods at the same token budget.

02

Maintains near-baseline accuracy even with aggressive token reduction.

03

Reduces TTFT and KV cache footprint significantly.

Abstract

Vision-Language Models (VLMs) are expensive because the LLM processes hundreds of largely redundant visual tokens. Existing token reduction methods typically exploit \textit{either} vision-encoder saliency (broad but query-agnostic) \textit{or} LLM cross-attention (query-aware but sparse and costly). We show that neither signal alone is sufficient: fusing them consistently improves performance compared to unimodal visual token selection (ranking). However, making such fusion practical is non-trivial: cross-modal saliency is usually only available \emph{inside} the LLM (too late for efficient pre-LLM pruning), and the two signals are inherently asymmetric, so naive fusion underutilizes their complementary strengths. We propose \textbf{ConsensusDrop}, a training-free framework that derives a \emph{consensus} ranking by reconciling vision encoder saliency with query-aware cross-attention,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning