Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models
Kexin Ma, Jing Xiao, Chaofeng Chen, Geyong Min, Guibo Zhu, Jinqiao Wang, Liang Liao

TL;DR
DeSAP introduces a decoupled similarity approach for task-aware token pruning in large vision-language models, enhancing efficiency while maintaining high accuracy by leveraging cross-modal relevance and visual saliency.
Contribution
The paper proposes a novel decoupled similarity method for precise, task-aware token pruning that outperforms existing techniques in accuracy and efficiency.
Findings
DeSAP achieves 10x FLOPs reduction on LLaVA-1.5-7B.
DeSAP retains 98.1% performance with only 11.1% visual tokens.
DeSAP outperforms SOTA methods across multiple benchmarks.
Abstract
Token pruning has emerged as an effective approach to reduce the substantial computational overhead of Large Vision-Language Models (LVLMs) by discarding less informative visual tokens while preserving performance. However, existing methods typically rely on individual attention sources from different LVLM components, resulting in incomplete and suboptimal pruning decisions due to biased attention distributions. To address this problem, we propose DeSAP, a novel Decoupled Similarity-Aware Pruning method for precise, task-aware token pruning within the visual encoder. Specifically, DeSAP introduces a decoupled similarity to capture fine-grained cross-modal relevance between visual features and text tokens, providing explicit task-related guidance for pruning. By integrating decoupled similarity with visual saliency signals derived from visual attention, DeSAP performs token pruning under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
