Mixing Importance with Diversity: Joint Optimization for KV Cache Compression in Large Vision-Language Models
Xuyang Liu, Xiyan Gui, Yuchao Zhang, Linfeng Zhang

TL;DR
This paper introduces MixKV, a novel KV cache compression method for large vision-language models that balances importance and diversity, effectively reducing memory usage while preserving semantic coverage and improving performance.
Contribution
MixKV adaptively combines importance and diversity for KV cache compression, addressing modality-specific redundancy in multi-modal models, and demonstrates significant performance improvements.
Findings
MixKV improves baseline methods by 5.1% on multi-modal benchmarks.
Achieves 8.0% and 9.0% gains on GUI grounding tasks.
Maintains inference efficiency while enhancing compression.
Abstract
Recent large vision-language models (LVLMs) demonstrate remarkable capabilities in processing extended multi-modal sequences, yet the resulting key-value (KV) cache expansion creates a critical memory bottleneck that fundamentally limits deployment scalability. While existing KV cache compression methods focus on retaining high-importance KV pairs to minimize storage, they often overlook the modality-specific semantic redundancy patterns that emerge distinctively in multi-modal KV caches. In this work, we first analyze how, beyond simple importance, the KV cache in LVLMs exhibits varying levels of redundancy across attention heads. We show that relying solely on importance can only cover a subset of the full KV cache information distribution, leading to potential loss of semantic coverage. To address this, we propose MixKV, a novel method that mixes importance with diversity for…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper provides a thorough analysis of semantic redundancy in LVLMs, highlighting a previously underexplored limitation of importance-only KV compression. 2. The adaptive mixing of importance and diversity per head is both intuitive and empirically validated. 3. MixKV can be integrated with existing compression methods without modifying the underlying compression operator, making it practical for real-world deployment.
1. The paper does not include a direct comparison with closely related methods such as [HeadKV](https://arxiv.org/abs/2410.19258), which would strengthen the empirical evaluation. 2. The analysis of Cross-modality Redundancy Differences does not clearly illustrate the "cross-modality" aspect. From my understanding, it's more about *Two different modality*. 3. While the paper discusses Head-wise Redundancy Differences, similar redundancy patterns may also exist in pure LLMs. The unique characte
1. The paper effectively identifies a key problem through insightful feature analysis and subsequently proposes a targeted solution. The motivation is well-grounded, the methodology is appropriate, and the results are solid and convincing. 2. Although the proposed KV cache compression scheme is designed for VLMs, its effectiveness is also validated on text-only tasks. This demonstrates the method's strong generalizability and versatility. 3. The experimental evaluation is thorough. The method
1. No significant flaws were identified. 2. However, the paper could be further strengthened by including an analysis of the persistent performance gap that remains when compared to the full KV cache. 3. Additionally, a broader discussion that horizontally situates the proposed method among other KV cache (or GPU memory) saving techniques would be valuable. While direct experimental comparisons are not strictly necessary, a qualitative discussion of the trade-offs relative to other approaches
The key advantage of MixKV is its comprehensive benchmarking across various tasks and models. It demonstrates consistent performance improvements across multiple multi-modal and text understanding benchmarks, including DocVQA, TextVQA, and LongBench, as well as GUI Grounding tasks.
1. I believe the paper spends unnecessary length discussing **modality-specific redundancy differences** and **head-wise redundancy differences**, as these concepts have already been well-established in previous works, such as **MMinference** and **VisionZip**. For instance, the **head-wise redundancy** can be directly observed in **MMinference's Fig. 1**, making it redundant to claim this in such detail. Additionally, I find it questionable to use **text on Qwen2** and **image on Qwen2-VL** to
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
