Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding

Fatih Ilhan; Gaowen Liu; Ramana Rao Kompella; Selim Furkan Tekin; Tiansheng Huang; Zachary Yahn; Yichang Xu; Ling Liu

arXiv:2603.23914·cs.CV·March 26, 2026

Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding

Fatih Ilhan, Gaowen Liu, Ramana Rao Kompella, Selim Furkan Tekin, Tiansheng Huang, Zachary Yahn, Yichang Xu, Ling Liu

PDF

Open Access

TL;DR

AttentionPack is a novel framework that significantly enhances memory efficiency and inference speed in large vision-language models by employing attention-aware compression and decompression techniques, enabling longer contexts and faster processing.

Contribution

The paper introduces a new attention-aware optimization framework, AttentionPack, which reduces memory overhead and latency in large VLMs during decoding, especially for long-context multi-modal tasks.

Findings

01

Memory efficiency improved by up to 8x

02

Faster inference with higher batch sizes

03

Maintains output quality and longer context capabilities

Abstract

Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens. This paper presents AttentionPack, an adaptive and attention-aware optimization framework tailored for large vision-language models with improving memory-efficiency during decoding, focusing on addressing the challenges due to the increased high number of visual inputs and interactions, particularly in long-context tasks with multiple high-resolution images or videos. AttentionPack is novel in two aspects: (i) We introduce a multi-head attention compaction method for economically storing key and value matrices by exploiting the implicit low-rank structure, and (ii) we develop a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications