Attention Debiasing for Token Pruning in Vision Language Models
Kai Zhao, Wubang Yuan, Yuchen Lin, Liting Ruan, Xiaofeng Lu, Deng-Ping Fan, Ming-Ming Cheng, Dan Zeng

TL;DR
This paper identifies biases in attention mechanisms of vision-language models that affect token pruning and introduces debiasing techniques to improve pruning accuracy and model efficiency across various benchmarks.
Contribution
The authors propose two lightweight debiasing methods that correct attention biases in VLMs, enhancing token pruning effectiveness without model or task restrictions.
Findings
Significant performance improvements across ten benchmarks.
Effective removal of attention biases improves token relevance.
Method is model-agnostic and easily integrable.
Abstract
Vision-language models (VLMs) typically encode substantially more visual tokens than text tokens, resulting in significant token redundancy. Pruning uninformative visual tokens is therefore crucial for improving computational efficiency, and language-to-vision attention has become a widely used importance criterion for this purpose. However, we find that attention in VLMs is systematically biased. It disproportionately favors tokens appearing later in the sequence, manifesting as over-attention to lower image regions, and assigns inflated scores to semantically empty padding tokens. These behaviors stem from intrinsic recency bias and attention sink effects inherited from large language models (LLMs), and they distort attention-based pruning by preserving irrelevant visual content. To derive a pruning criterion better aligned with semantic relevance, we introduce two lightweight yet…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
