MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference
Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang

TL;DR
MEDA introduces a dynamic, layer-wise KV cache allocation method based on attention entropy to improve inference efficiency in multimodal large language models, significantly reducing memory use and increasing decoding speed.
Contribution
The paper presents MEDA, a novel approach that dynamically allocates KV cache per layer using attention entropy, addressing limitations of uniform cache reduction strategies.
Findings
Achieves up to 72% KV cache memory reduction.
Provides 2.82 times faster decoding speed.
Maintains or improves performance on multimodal long-context tasks.
Abstract
Long-context Multimodal Large Language Models (MLLMs) that incorporate long text-image and text-video modalities, demand substantial resources as their multimodal Key-Value (KV) caches grow with increasing input lengths, challenging inference efficiency. Existing methods for KV cache compression, in both text-only and multimodal LLMs, have neglected attention density variations across layers, thus often adopting uniform or progressive reduction strategies for layer-wise cache allocation. In this work, we propose MEDA, a dynamic layer-wise KV cache allocation method for efficient multimodal long-context inference. As its core, MEDA utilizes cross-modal attention entropy to determine the KV cache size at each MLLMs layer. Given the dynamically allocated KV cache size at each layer, MEDA also employs a KV pair selection scheme to identify which KV pairs to select and a KV pair merging…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Data Compression Techniques · Speech Recognition and Synthesis
MethodsSoftmax · Attention Is All You Need
