StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding

Yanlai Yang; Zhuokai Zhao; Satya Narayan Shukla; Aashu Singh; Shlok Kumar Mishra; Lizhu Zhang; Mengye Ren

arXiv:2508.15717·cs.CV·August 22, 2025

StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding

Yanlai Yang, Zhuokai Zhao, Satya Narayan Shukla, Aashu Singh, Shlok Kumar Mishra, Lizhu Zhang, Mengye Ren

PDF

Open Access 3 Reviews

TL;DR

StreamMem introduces a query-agnostic KV cache mechanism for streaming video understanding, enabling efficient long-video processing in large language models by compressing visual context without prior question access.

Contribution

It proposes a novel streaming, query-agnostic KV cache method that reduces memory overhead and improves long-video understanding in multimodal models.

Findings

01

Achieves state-of-the-art query-agnostic KV cache compression.

02

Performs competitively with query-aware approaches.

03

Effective in long video and streaming question answering tasks.

Abstract

Multimodal large language models (MLLMs) have made significant progress in visual-language reasoning, but their ability to efficiently handle long videos remains limited. Despite recent advances in long-context MLLMs, storing and attending to the key-value (KV) cache for long visual contexts incurs substantial memory and computational overhead. Existing visual compression methods require either encoding the entire visual context before compression or having access to the questions in advance, which is impractical for long video understanding and multi-turn conversational settings. In this work, we propose StreamMem, a query-agnostic KV cache memory mechanism for streaming video understanding. Specifically, StreamMem encodes new video frames in a streaming manner, compressing the KV cache using attention scores between visual tokens and generic query tokens, while maintaining a…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. This paper is well-written. 2. Using chat template tokens to extract universal visual information is a simply but interesting idea. 3. StreamMem shows great performance in long video understanding task.

Weaknesses

1. The paper lacks comparison against essential KV Cache compression baselines, such as StreamingLLM[1] and H2O[2], hindering a clear assessment of its relative performance. 2. An ablation study for the INPUT FRAME FILTERING module is necessary to validate its specific contribution and effectiveness. 3. The manuscript requires a deeper analysis of why the chat template tokens have the observed effect, as the current explanation is insufficient. 4. The contribution appears limited. Both INPUT FRA

Reviewer 02Rating 2Confidence 4

Strengths

1. The paper is well-presented, and the core idea is illustrated very clearly. 2. The technical approach makes sense. The use of kv-cache based on the chat-template is well-motivated. I found the ablation study in Table 4 particularly insightful. It demonstrates a reasonable trade-off between performance and generalization.

Weaknesses

1. The paper lacks evaluation on several important streaming benchmarks, which weakens its overall contribution. As a training-free method, comprehensive evaluation is crucial. Consider including results on OVO-Bench [1], OV-Bench [2], StreamBench [3], and StreamingBench [4]. 2. KV pruning based on attention is not new to the community; several prior works have explored this area [5,6]. I feel like the contribution is not sufficient if only making it both query-agnostic and streaming mode. [

Reviewer 03Rating 6Confidence 4

Strengths

This paper addresses a practically important problem in streaming video understanding with MLLMs and provides a systematic integration of existing techniques. The work demonstrates reasonable technical execution with comprehensive experiments across multiple benchmarks and three different MLLMs, supported by ablation studies that validate component contributions. The paper is clearly written with effective visualizations and covers relevant literature adequately. The training-free, plug-and-play

Weaknesses

**Incremental improvements over existing work**: The core technical components heavily overlap with prior methods. Cosine similarity-based frame filtering is similar to temporal compression in LongVU, cross-attention based KV pruning follows established patterns in LiveVLM and other KV compression works, and frame-wise merging has been explored in MA-LMM and related papers. The differences from InfiniPot-V appear incremental, mainly in the choice of proxy queries and specific implementation deta

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks