AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for   Vision-Language Models

Zunhai Su; Wang Shen; Linge Li; Zhe Chen; Hanyu Wei; Huangqi Yu,; Kehong Yuan

arXiv:2501.15021·cs.CL·January 28, 2025

AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for Vision-Language Models

Zunhai Su, Wang Shen, Linge Li, Zhe Chen, Hanyu Wei, Huangqi Yu,, Kehong Yuan

PDF

Open Access

TL;DR

AKVQ-VL introduces an attention-aware adaptive 2-bit quantization method for vision-language models, significantly reducing memory and I/O bottlenecks while maintaining or improving task accuracy.

Contribution

It proposes a novel attention-aware quantization approach that adaptively allocates bit budgets based on token saliency and effectively handles outliers with Walsh-Hadamard transform.

Findings

01

Reduces peak memory by 2.13x

02

Supports up to 3.25x larger batch sizes

03

Achieves comparable or better accuracy on multimodal tasks

Abstract

Vision-language models (VLMs) show remarkable performance in multimodal tasks. However, excessively long multimodal inputs lead to oversized Key-Value (KV) caches, resulting in significant memory consumption and I/O bottlenecks. Previous KV quantization methods for Large Language Models (LLMs) may alleviate these issues but overlook the attention saliency differences of multimodal tokens, resulting in suboptimal performance. In this paper, we investigate the attention-aware token saliency patterns in VLM and propose AKVQ-VL. AKVQ-VL leverages the proposed Text-Salient Attention (TSA) and Pivot-Token-Salient Attention (PSA) patterns to adaptively allocate bit budgets. Moreover, achieving extremely low-bit quantization requires effectively addressing outliers in KV tensors. AKVQ-VL utilizes the Walsh-Hadamard transform (WHT) to construct outlier-free KV caches, thereby reducing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Brain Tumor Detection and Classification

MethodsSoftmax · Attention Is All You Need