Make Your LVLM KV Cache More Lightweight

Xihao Chen; Yangyang Guo; Roger Zimmermann

arXiv:2605.00789·cs.CV·May 4, 2026

Make Your LVLM KV Cache More Lightweight

Xihao Chen, Yangyang Guo, Roger Zimmermann

PDF

TL;DR

LightKV is a novel method that reduces GPU memory and computation in LVLMs by compressing vision tokens through cross-modality message passing guided by text prompts.

Contribution

The paper introduces LightKV, a prompt-aware compression technique that significantly reduces KV cache size and computation in LVLMs while maintaining performance.

Findings

01

Halves the vision-token KV cache size

02

Reduces computation by up to 40%

03

Maintains performance across multiple benchmarks

Abstract

Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency in Large Language Models (LLMs), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage. To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings. Guided by text prompts, LightKV employs cross-modality message passing to aggregate informative messages across vision tokens and progressively compress them during prefill. This prompt-aware guidance distinguishes our method from prior vision-only compression strategies. We evaluate LightKV on eight open-source LVLMs across eight public benchmark datasets, e.g., MME and SeedBench. Experimental results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.