Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models
Jie Ma, Yihang Liu, Zhike Qiu, Jiayi Ji, Xiaoshuai Sun

TL;DR
This paper introduces COAST, a novel adaptive semantic token pruning method for vision-language models that preserves essential visual information, reducing tokens significantly while maintaining high performance and speed.
Contribution
COAST offers a training-free, adaptive semantic routing approach that outperforms existing pruning methods in vision-language reasoning tasks.
Findings
Reduces visual tokens by 77.8% with 2.15x speedup
Retains 98.64% of original performance across benchmarks
Outperforms strong pruning baselines across multiple LVLMs
Abstract
Are low-attention visual tokens truly redundant in vision-language reasoning? Existing pruning methods often assume so, ranking visual tokens by shallow text-to-image attention and discarding low-scoring patches to accelerate LVLM inference. We show that this scalar criterion is unreliable for compositional reasoning: tokens ignored in early layers can later become essential for resolving secondary objects, spatial relations, and contextual cues. Premature pruning can therefore induce Visual Aphasia, a failure mode in which the model loses visual grounding and falls back on language priors. We introduce COAST (COntrastive Adaptive Semantic Token Pruning), a training-free pruning framework that casts compression as adaptive semantic routing. COAST uses native cross-modal attention to identify query-specific anchors and estimate contextual dispersion via attention entropy, then adapts the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
