HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference
Bowen Zeng, Feiyang Ren, Jun Zhang, Xiaoling Gu, Ke Chen, Lidan Shou, Huan Li

TL;DR
HybridKV introduces a novel cache compression framework for multimodal large language models, significantly reducing memory usage and decoding latency while maintaining or improving performance.
Contribution
It presents a hybrid compression strategy that classifies attention heads and applies tailored compression methods, outperforming existing approaches.
Findings
Reduces KV cache memory by up to 7.9 times.
Achieves 1.52 times faster decoding.
Maintains or improves model performance.
Abstract
Multimodal Large Language Models (MLLMs) have advanced unified reasoning over text, images, and videos, but their inference is hindered by the rapid growth of key-value (KV) caches. Each visual input expands into thousands of tokens, causing caches to scale linearly with context length and remain resident in GPU memory throughout decoding, which leads to prohibitive memory overhead and latency even on high-end GPUs. A common solution is to compress caches under a fixed allocated budget at different granularities: token-level uniformly discards less important tokens, layer-level varies retention across layers, and head-level redistributes budgets across heads. Yet these approaches stop at allocation and overlook the heterogeneous behaviors of attention heads that require distinct compression strategies. We propose HybridKV, a hybrid KV cache compression framework that integrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
