Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

Vatsal Agarwal; Saksham Suri; Matthew Gwilliam; Pulkit Kumar; Abhinav Shrivastava

arXiv:2602.18434·cs.CV·February 23, 2026

Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory

Vatsal Agarwal, Saksham Suri, Matthew Gwilliam, Pulkit Kumar, Abhinav Shrivastava

PDF

Open Access

TL;DR

MemStream enhances video stream understanding by scaling token usage and employing adaptive and external retrieval strategies, significantly improving performance on multiple benchmarks.

Contribution

This work introduces MemStream, a novel approach that scales token budgets and incorporates adaptive selection and external models for better dense video stream understanding.

Findings

01

+8.0% on CG-Bench

02

+8.5% on LVBench

03

+2.4% on VideoMME (Long)

Abstract

Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information. We further propose a training-free retrieval…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition