CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding

Shrenik Patel; Daivik Patel

arXiv:2511.13644·cs.CV·November 18, 2025

CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding

Shrenik Patel, Daivik Patel

PDF

Open Access

TL;DR

CacheFlow introduces a training-free, memory-efficient method for long-form video question answering that combines dynamic token dropping with compressive long-term memory, enabling real-time streaming analysis with reduced computational load.

Contribution

It proposes a novel online token pruning and memory compression technique that allows vision-language models to perform long-form video understanding efficiently without fine-tuning.

Findings

01

Outperforms existing baselines on offline and streaming VQA benchmarks.

02

Processes up to 87% fewer tokens while maintaining accuracy.

03

Enables real-time, context-aware long-form video analysis.

Abstract

Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce CacheFlow, a training-free pipeline that pairs Dynamic Token Dropping (DTD) with a compressive long-term memory. DTD prunes per-patch tokens online via cosine similarity to the previous frame, and surviving tokens are packed into fixed-size blocks. This online, per-frame processing makes our approach fundamentally suited for live streaming VQA. As blocks are processed, each one's keys are summarized by a tiny recurrent encoder to form a retrieval index, while the block's full KV pairs are offloaded and later rehydrated for generation, preserving answer fidelity. At inference, a consensus-based retrieval mechanism retrieves only the Top-K most relevant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques