MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
Junbin Xiao, Jiajun Chen, Tianxiang Sun, Xun Yang, Angela Yao

TL;DR
MuKV introduces a multi-grained KV cache compression and semi-hierarchical retrieval method to enhance efficiency and accuracy in long streaming VideoQA tasks, effectively managing visual token growth.
Contribution
The paper proposes a novel multi-grained KV cache compression and retrieval approach that preserves spatial and temporal details while improving streaming VideoQA performance.
Findings
Significant improvement in answer accuracy on long-streaming VideoQA benchmarks.
Effective memory reduction through dual signal token compression.
Enhanced online QA efficiency without sacrificing accuracy.
Abstract
Long streaming video QA remains challenging due to growing visual tokens and limited reasoning length of large language models (LLMs). KV-caching stores the Key-Value (KV) of the historical tokens via LLM prefill and enables more efficient streaming QA. However, existing methods cache every one or two frames, causing redundant memory usage and losing fine-grained spatial details within frame or temporal contexts across frames. This paper proposes MuKV, a method that features a multi-grained KV cache compression module and a semi-hierarchical retrieval approach to improve both efficiency and accuracy for long streaming VideoQA. For the offline KV cache, MuKV extracts visual representations at patch-, frame-, and segment-levels. The multiple levels of granularity preserve both local cues and global temporal context, while maintaining efficiency with a dual signal token compression…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
