Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models

Jie Ma; Yihang Liu; Zhike Qiu; Jiayi Ji; Xiaoshuai Sun

arXiv:2605.09429·cs.CV·May 12, 2026

Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models

Jie Ma, Yihang Liu, Zhike Qiu, Jiayi Ji, Xiaoshuai Sun

PDF

TL;DR

This paper introduces COAST, a novel adaptive semantic token pruning method for vision-language models that preserves essential visual information, reducing tokens significantly while maintaining high performance and speed.

Contribution

COAST offers a training-free, adaptive semantic routing approach that outperforms existing pruning methods in vision-language reasoning tasks.

Findings

01

Reduces visual tokens by 77.8% with 2.15x speedup

02

Retains 98.64% of original performance across benchmarks

03

Outperforms strong pruning baselines across multiple LVLMs

Abstract

Are low-attention visual tokens truly redundant in vision-language reasoning? Existing pruning methods often assume so, ranking visual tokens by shallow text-to-image attention and discarding low-scoring patches to accelerate LVLM inference. We show that this scalar criterion is unreliable for compositional reasoning: tokens ignored in early layers can later become essential for resolving secondary objects, spatial relations, and contextual cues. Premature pruning can therefore induce Visual Aphasia, a failure mode in which the model loses visual grounding and falls back on language priors. We introduce COAST (COntrastive Adaptive Semantic Token Pruning), a training-free pruning framework that casts compression as adaptive semantic routing. COAST uses native cross-modal attention to identify query-specific anchors and estimate contextual dispersion via attention entropy, then adapts the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.