Don't Pause! Every prediction matters in a streaming video
Dibyadip Chatterjee, Zhanzhong Pang, Fadime Sener, Yale Song, Angela Yao

TL;DR
This paper introduces SPOT-Bench, a new streaming VideoQA benchmark with a novel metric, and proposes AsynKV, a training-free model that improves real-time streaming perception and response behavior.
Contribution
The paper presents SPOT-Bench for evaluating streaming video models and introduces AsynKV, a new approach that enhances streaming perception without additional training.
Findings
Offline models detect events reliably but tend to spam predictions.
Silence training reduces spamming but causes unresponsiveness.
Half of the video segments require no response, called dead-time.
Abstract
Streaming video models should respond the moment an event unfolds, not after the moment has passed. Yet existing online VideoQA benchmarks remain largely retrospective. They pause the video at fixed timestamps, pose questions about current or past events, and score models only at those moments. This protocol leaves streaming predictions untested. To close this gap, we introduce SPOT-Bench, featuring multi-turn proactive queries that evaluate general streaming perception and assistive capabilities required by an always-on, real-time assistant. SPOT-Bench comes with Timeliness-F1, a consolidated metric that measures streaming predictions by their temporal precision and balanced coverage across the entire video. Our benchmark reveals: (i) offline models detect events reliably but spam predictions unprompted; (ii) post-training for silence reduces spamming but induces unresponsiveness;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
