MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering

Junbin Xiao; Jiajun Chen; Tianxiang Sun; Xun Yang; Angela Yao

arXiv:2605.22269·cs.CV·May 22, 2026

MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering

Junbin Xiao, Jiajun Chen, Tianxiang Sun, Xun Yang, Angela Yao

PDF

TL;DR

MuKV introduces a multi-grained KV cache compression and semi-hierarchical retrieval method to enhance efficiency and accuracy in long streaming VideoQA tasks, effectively managing visual token growth.

Contribution

The paper proposes a novel multi-grained KV cache compression and retrieval approach that preserves spatial and temporal details while improving streaming VideoQA performance.

Findings

01

Significant improvement in answer accuracy on long-streaming VideoQA benchmarks.

02

Effective memory reduction through dual signal token compression.

03

Enhanced online QA efficiency without sacrificing accuracy.

Abstract

Long streaming video QA remains challenging due to growing visual tokens and limited reasoning length of large language models (LLMs). KV-caching stores the Key-Value (KV) of the historical tokens via LLM prefill and enables more efficient streaming QA. However, existing methods cache every one or two frames, causing redundant memory usage and losing fine-grained spatial details within frame or temporal contexts across frames. This paper proposes MuKV, a method that features a multi-grained KV cache compression module and a semi-hierarchical retrieval approach to improve both efficiency and accuracy for long streaming VideoQA. For the offline KV cache, MuKV extracts visual representations at patch-, frame-, and segment-levels. The multiple levels of granularity preserve both local cues and global temporal context, while maintaining efficiency with a dual signal token compression…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.