ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference
Surendra Pathak, Bo Han

TL;DR
ASAP introduces a training-free, attention-shift-aware pruning method for LVLMs that significantly reduces computation while maintaining near-original performance by intelligently selecting and merging visual tokens.
Contribution
The paper presents a novel pruning approach that addresses attention shift and token redundancy without retraining, improving inference efficiency for LVLMs.
Findings
Retains 99.02% of original performance
Reduces FLOPs by approximately 80%
Addresses attention shift and token redundancy effectively
Abstract
While Large Vision-Language Models (LVLMs) demonstrate exceptional multi-modal capabilities, the quadratic computational cost of processing high-resolution visual tokens remains a critical bottleneck. Though recent token reduction strategies attempt to accelerate inference, such methods inadequately exploit attention values and fail to address token redundancy. More critically, they overlook the ``attention shift'' phenomenon inherent in LVLMs, which skews token attention scores. In this work, we propose ASAP, a novel training-free, KV-Cache-compatible pruning recipe that comprehensively addresses these limitations. First, we mitigate the attention shift by utilizing a dynamic bidirectional soft attention mask, ensuring the selection of genuinely informative tokens rather than naive attention-based selection. Second, we posit that high semantic redundancy within the token set degrades…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
