Make Your LVLM KV Cache More Lightweight
Xihao Chen, Yangyang Guo, Roger Zimmermann

TL;DR
LightKV is a novel method that reduces GPU memory and computation in LVLMs by compressing vision tokens through cross-modality message passing guided by text prompts.
Contribution
The paper introduces LightKV, a prompt-aware compression technique that significantly reduces KV cache size and computation in LVLMs while maintaining performance.
Findings
Halves the vision-token KV cache size
Reduces computation by up to 40%
Maintains performance across multiple benchmarks
Abstract
Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency in Large Language Models (LLMs), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage. To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings. Guided by text prompts, LightKV employs cross-modality message passing to aggregate informative messages across vision tokens and progressively compress them during prefill. This prompt-aware guidance distinguishes our method from prior vision-only compression strategies. We evaluate LightKV on eight open-source LVLMs across eight public benchmark datasets, e.g., MME and SeedBench. Experimental results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
