MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference

Kunxi Li; Zhonghua Jiang; Zhouzhou Shen; Zhaode Wang; Chengfei Lv; Shengyu Zhang; Fan Wu; Fei Wu

arXiv:2506.15724·cs.LG·June 23, 2025

MadaKV: Adaptive Modality-Perception KV Cache Eviction for Efficient Multimodal Long-Context Inference

Kunxi Li, Zhonghua Jiang, Zhouzhou Shen, Zhaode Wang, Chengfei Lv, Shengyu Zhang, Fan Wu, Fei Wu

PDF

Open Access

TL;DR

MadaKV introduces a dynamic, modality-aware cache eviction strategy that significantly reduces memory and latency in multimodal large language models during long-context inference, while maintaining high accuracy.

Contribution

It proposes a novel, adaptive KV cache eviction method tailored for multimodal models, addressing modality importance disparities across attention heads.

Findings

01

Reduces KV cache memory footprint substantially.

02

Improves inference decoding latency by 1.3 to 1.5 times.

03

Maintains high accuracy across various multimodal tasks.

Abstract

This paper introduces MadaKV, a modality-adaptive key-value (KV) cache eviction strategy designed to enhance the efficiency of multimodal large language models (MLLMs) in long-context inference. In multimodal scenarios, attention heads exhibit varying preferences for different modalities, resulting in significant disparities in modality importance across attention heads. Traditional KV cache eviction methods, which are tailored for unimodal settings, fail to capture modality-specific information, thereby yielding suboptimal performance. MadaKV addresses these challenges through two key components: modality preference adaptation and hierarchical compression compensation. By dynamically sensing modality information within attention heads and adaptively retaining critical tokens, MadaKV achieves substantial reductions in KV cache memory footprint and model inference decoding latency (1.3…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Machine Learning in Healthcare