TL;DR
HAWK is a training-free, importance-aware visual token pruning method for multimodal large language models that retains accuracy while significantly reducing inference time and resource usage.
Contribution
It introduces a novel head importance-aware approach that leverages attention head importance and text-guided attention to effectively prune visual tokens in MLLMs.
Findings
Retains 96.0% accuracy after pruning 80.2% of visual tokens.
Reduces end-to-end latency to 74.4% of original.
Decreases GPU memory usage across models.
Abstract
In multimodal large language models (MLLMs), the surge of visual tokens significantly increases the inference time and computational overhead, making them impractical for real-time or resource-constrained applications. Visual token pruning is a promising strategy for reducing the cost of MLLM inference by removing redundant visual tokens. Existing research usually assumes that all attention heads contribute equally to the visual interpretation. However, our study reveals that different heads may capture distinct visual semantics and inherently play distinct roles in visual processing. In light of this observation, we propose HAWK, a head importance-aware visual token pruning method that perceives the varying importance of attention heads in visual tasks to maximize the retention of crucial tokens. By leveraging head importance weights and text-guided attention to assess visual token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
