Decouple and Cache: KV Cache Construction for Streaming Video Understanding
Zhanzhong Pang, Dibyadip Chatterjee, Fadime Sener, Angela Yao

TL;DR
This paper introduces DSCache, a training-free mechanism for constructing key-value caches in streaming video understanding, enabling models to process unbounded streams efficiently and accurately.
Contribution
The paper presents DSCache, a novel cache construction method that decouples past and current caches and incorporates position-agnostic encoding, improving streaming video model performance.
Findings
Achieves 2.5% accuracy improvement over prior methods on Streaming Video QA benchmarks.
Effectively maintains and updates caches for unbounded video streams.
Supports position extrapolation beyond training length, preventing position overflow.
Abstract
Streaming video understanding requires processing unbounded video streams with limited memory and computation, posing two key challenges. First, continuously constructing new and evicting old key-value(KV) caches is required for unbounded streams. Secondly, due to the high cost of collecting and training on unbounded streams, models must learn from short sequences while generalizing to long streams. Existing streaming VideoVLLMs fail to scale to unbounded video streams or focus on cache reuse strategies, leaving the impact of cache construction underexplored. In this paper, we propose Decoupled Streaming Cache(DSCache), a training-free cache construction mechanism that adapts pretrained offline models to streaming settings. DSCache maintains a cumulative past KV cache while constructing a separate instant cache on-demand, decoupled from past caches to preserve the informativeness of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
