TL;DR
HiPrune introduces a hierarchical attention-based token pruning method for vision-language models that significantly reduces computation while maintaining high accuracy, leveraging intrinsic attention patterns in the encoder.
Contribution
The paper proposes a training-free, model-agnostic token pruning method utilizing hierarchical attention patterns, and introduces HiPrune++ for improved instruction following at low token budgets.
Findings
Achieves up to 99.3% task accuracy with only 1/3 tokens.
Reduces inference FLOPs by 58.7%.
Maintains up to 99.7% accuracy with 2/9 tokens, showing robustness.
Abstract
Vision-Language Models (VLMs) encode images and videos into abundant tokens, which contain substantial redundancy and computation cost. While visual token pruning mitigates the issue, most existing methods lack insight into the intrinsic property of the vision encoder itself. In this work, we dive into the vision encoder and prove that the middle layers pay more attention to the main objects of the image qualitatively and quantitatively, while the deep layers to tokens with rich global information. Utilizing this Hierarchical attention pattern, we propose HiPrune, a training-free and model-agnostic token Pruning method. HiPrune identifies three types of visual tokens according to their attention in different phases of the vision encoder, which preserves different levels of information. By coupling with the similarity of text tokens, we propose a prompt-aware variance, HiPrune++, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
