Flash-VStream: Memory-Based Real-Time Understanding for Long Video Streams
Haoji Zhang, Yiqin Wang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng, Dai, Xiaojie Jin

TL;DR
Flash-VStream is a real-time, memory-efficient video-language model designed for online video streams, capable of understanding long videos and answering user queries with reduced latency and VRAM usage, outperforming existing methods.
Contribution
The paper introduces Flash-VStream, a novel memory-based model for online video understanding, and proposes VStream-QA, a new benchmark for streaming video question answering.
Findings
Achieves state-of-the-art performance on offline video understanding benchmarks.
Reduces inference latency and VRAM consumption significantly.
Demonstrates superior performance on the proposed online streaming benchmark.
Abstract
Benefiting from the advancements in large language models and cross-modal alignment, existing multi-modal video understanding methods have achieved prominent performance in offline scenario. However, online video streams, as one of the most common media forms in the real world, have seldom received attention. Compared to offline videos, the 'dynamic' nature of online video streams poses challenges for the direct application of existing models and introduces new problems, such as the storage of extremely long-term information, interaction between continuous visual content and 'asynchronous' user questions. Therefore, in this paper we present Flash-VStream, a video-language model that simulates the memory mechanism of human. Our model is able to process extremely long video streams in real-time and respond to user queries simultaneously. Compared to existing models, Flash-VStream achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment · Video Coding and Compression Technologies · Advanced Data Compression Techniques
