TL;DR
A simple sliding-window baseline using recent frames with off-the-shelf VLMs matches or exceeds complex streaming video models, challenging the need for intricate memory mechanisms.
Contribution
Demonstrates that a straightforward recent-frame approach can outperform complex memory-based streaming video models, urging a reevaluation of current benchmarking practices.
Findings
SimpleStream achieves 67.7% accuracy on OVO-Bench with only 4 frames.
Longer context benefits depend on the backbone model, not always improving performance.
Adding historical context can improve recall but may weaken real-time perception.
Abstract
Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
