VL-Cache: Sparsity and Modality-Aware KV Cache Compression for   Vision-Language Model Inference Acceleration

Dezhan Tu; Danylo Vashchilenko; Yuzhe Lu; Panpan Xu

arXiv:2410.23317·cs.CV·November 1, 2024

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration

Dezhan Tu, Danylo Vashchilenko, Yuzhe Lu, Panpan Xu

PDF

Open Access

TL;DR

VL-Cache introduces a sparsity and modality-aware KV cache compression technique tailored for vision-language models, significantly reducing memory usage and accelerating inference while maintaining high accuracy.

Contribution

The paper presents a novel cache compression method for VLMs that adapts to layer-specific sparsity and token importance, outperforming existing approaches.

Findings

01

Retaining 10% of KV cache maintains accuracy.

02

Up to 2.33x end-to-end latency speedup.

03

GPU memory footprint reduced by 90%.

Abstract

Vision-Language Models (VLMs) have demonstrated impressive performance across a versatile set of tasks. A key challenge in accelerating VLMs is storing and accessing the large Key-Value (KV) cache that encodes long visual contexts, such as images or videos. While existing KV cache compression methods are effective for Large Language Models (LLMs), directly migrating them to VLMs yields suboptimal accuracy and speedup. To bridge the gap, we propose VL-Cache, a novel KV cache compression recipe tailored for accelerating VLM inference. In this paper, we first investigate the unique sparsity pattern of VLM attention by distinguishing visual and text tokens in prefill and decoding phases. Based on these observations, we introduce a layer-adaptive sparsity-aware cache budget allocation method that effectively distributes the limited cache budget across different layers, further reducing KV…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Advanced Neural Network Applications

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings