CacheFlow: Compressive Streaming Memory for Efficient Long-Form Video Understanding
Shrenik Patel, Daivik Patel

TL;DR
CacheFlow introduces a training-free, memory-efficient method for long-form video question answering that combines dynamic token dropping with compressive long-term memory, enabling real-time streaming analysis with reduced computational load.
Contribution
It proposes a novel online token pruning and memory compression technique that allows vision-language models to perform long-form video understanding efficiently without fine-tuning.
Findings
Outperforms existing baselines on offline and streaming VQA benchmarks.
Processes up to 87% fewer tokens while maintaining accuracy.
Enables real-time, context-aware long-form video analysis.
Abstract
Long-form video question answering (VQA) overwhelms current vision-language models (VLMs) because attention and key-value (KV) caches grow with runtime, forcing either expensive inference or near-sighted sliding windows. We introduce CacheFlow, a training-free pipeline that pairs Dynamic Token Dropping (DTD) with a compressive long-term memory. DTD prunes per-patch tokens online via cosine similarity to the previous frame, and surviving tokens are packed into fixed-size blocks. This online, per-frame processing makes our approach fundamentally suited for live streaming VQA. As blocks are processed, each one's keys are summarized by a tiny recurrent encoder to form a retrieval index, while the block's full KV pairs are offloaded and later rehydrated for generation, preserving answer fidelity. At inference, a consensus-based retrieval mechanism retrieves only the Top-K most relevant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
