MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context   Inference

Zhongwei Wan; Hui Shen; Xin Wang; Che Liu; Zheda Mai; Mi Zhang

arXiv:2502.17599·cs.CL·March 14, 2025

MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference

Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

MEDA introduces a dynamic, layer-wise KV cache allocation method based on attention entropy to improve inference efficiency in multimodal large language models, significantly reducing memory use and increasing decoding speed.

Contribution

The paper presents MEDA, a novel approach that dynamically allocates KV cache per layer using attention entropy, addressing limitations of uniform cache reduction strategies.

Findings

01

Achieves up to 72% KV cache memory reduction.

02

Provides 2.82 times faster decoding speed.

03

Maintains or improves performance on multimodal long-context tasks.

Abstract

Long-context Multimodal Large Language Models (MLLMs) that incorporate long text-image and text-video modalities, demand substantial resources as their multimodal Key-Value (KV) caches grow with increasing input lengths, challenging inference efficiency. Existing methods for KV cache compression, in both text-only and multimodal LLMs, have neglected attention density variations across layers, thus often adopting uniform or progressive reduction strategies for layer-wise cache allocation. In this work, we propose MEDA, a dynamic layer-wise KV cache allocation method for efficient multimodal long-context inference. As its core, MEDA utilizes cross-modal attention entropy to determine the KV cache size at each MLLMs layer. Given the dynamically allocated KV cache size at each layer, MEDA also employs a KV pair selection scheme to identify which KV pairs to select and a KV pair merging…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aiot-mlsys-lab/meda
pytorchOfficial

Videos

MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference· underline

Taxonomy

TopicsAdvanced Data Compression Techniques · Speech Recognition and Synthesis

MethodsSoftmax · Attention Is All You Need