HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

Bowen Zeng; Feiyang Ren; Jun Zhang; Xiaoling Gu; Ke Chen; Lidan Shou; Huan Li

arXiv:2604.05887·cs.AI·April 8, 2026

HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

Bowen Zeng, Feiyang Ren, Jun Zhang, Xiaoling Gu, Ke Chen, Lidan Shou, Huan Li

PDF

TL;DR

HybridKV introduces a novel cache compression framework for multimodal large language models, significantly reducing memory usage and decoding latency while maintaining or improving performance.

Contribution

It presents a hybrid compression strategy that classifies attention heads and applies tailored compression methods, outperforming existing approaches.

Findings

01

Reduces KV cache memory by up to 7.9 times.

02

Achieves 1.52 times faster decoding.

03

Maintains or improves model performance.

Abstract

Multimodal Large Language Models (MLLMs) have advanced unified reasoning over text, images, and videos, but their inference is hindered by the rapid growth of key-value (KV) caches. Each visual input expands into thousands of tokens, causing caches to scale linearly with context length and remain resident in GPU memory throughout decoding, which leads to prohibitive memory overhead and latency even on high-end GPUs. A common solution is to compress caches under a fixed allocated budget at different granularities: token-level uniformly discards less important tokens, layer-level varies retention across layers, and head-level redistributes budgets across heads. Yet these approaches stop at allocation and overlook the heterogeneous behaviors of attention heads that require distinct compression strategies. We propose HybridKV, a hybrid KV cache compression framework that integrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.