KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy

Yingbing Huang; Tharun Adithya Srikrishnan; Steven K. Reinhardt; Deming Chen

arXiv:2605.16439·cs.CV·May 19, 2026

KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy

Yingbing Huang, Tharun Adithya Srikrishnan, Steven K. Reinhardt, Deming Chen

PDF

TL;DR

KVCapsule introduces a novel, structure-aware KV cache compression method for vision-language models, significantly reducing memory usage and increasing processing speed without sacrificing accuracy.

Contribution

The paper presents KVCapsule, a new framework that compresses vision token caches in VLMs, enabling efficient inference with minimal performance loss.

Findings

01

Up to 2x increase in throughput (TPS)

02

2.4x reduction in KV cache memory

03

Negligible degradation in model accuracy

Abstract

Vision-Language Models (VLMs) have emerged as a critical and fast-growing extension of Large Language Models (LLMs) that enable multimodal reasoning through both text and image inputs. Although VLMs enrich the capabilities of language models, they also inherit and amplify key computational bottlenecks: the memory overhead caused by the large key-value (KV) cache during autoregressive decoding. This challenge is particularly severe in VLMs, where images produce longer token sequences and denser feature representations compared to text. Moreover, the spatial and information-rich nature of vision tokens introduces structured attention patterns that make many LLM-oriented KV cache compression techniques ineffective when applied directly to VLMs. In this work, we conduct a detailed empirical analysis of the behavior of vision tokens, highlighting the critical differences from purely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.