ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference

Surendra Pathak; Bo Han

arXiv:2603.14549·cs.CV·March 19, 2026

ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference

Surendra Pathak, Bo Han

PDF

Open Access

TL;DR

ASAP introduces a training-free, attention-shift-aware pruning method for LVLMs that significantly reduces computation while maintaining near-original performance by intelligently selecting and merging visual tokens.

Contribution

The paper presents a novel pruning approach that addresses attention shift and token redundancy without retraining, improving inference efficiency for LVLMs.

Findings

01

Retains 99.02% of original performance

02

Reduces FLOPs by approximately 80%

03

Addresses attention shift and token redundancy effectively

Abstract

While Large Vision-Language Models (LVLMs) demonstrate exceptional multi-modal capabilities, the quadratic computational cost of processing high-resolution visual tokens remains a critical bottleneck. Though recent token reduction strategies attempt to accelerate inference, such methods inadequately exploit attention values and fail to address token redundancy. More critically, they overlook the ``attention shift'' phenomenon inherent in LVLMs, which skews token attention scores. In this work, we propose ASAP, a novel training-free, KV-Cache-compatible pruning recipe that comprehensively addresses these limitations. First, we mitigate the attention shift by utilizing a dynamic bidirectional soft attention mask, ensuring the selection of genuinely informative tokens rather than naive attention-based selection. Second, we posit that high semantic redundancy within the token set degrades…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications