EvoStreaming: Your Offline Video Model Is a Natively Streaming Assistant
Zichen Wen, Boxue Yang, Junlong Ke, Jiajie Huang, Chenfei Liao, Junxi Wang, Xuyang Liu, Linfeng Zhang

TL;DR
EvoStreaming introduces a self-evolved framework that adapts offline video-language models for real-time streaming tasks with minimal data and no architectural changes, improving responsiveness and relevance.
Contribution
The paper presents EvoStreaming, a novel self-evolved adaptation method that enhances offline VideoLLMs for streaming without extensive data or model modifications.
Findings
EvoStreaming improves streaming evaluation scores by up to 10.8 points.
It requires only 1,000 self-generated samples, 139 times less than previous methods.
It maintains strong offline video understanding while enhancing streaming interaction.
Abstract
Streaming video understanding demands more than watching longer videos: assistants must decide when to speak in real time, balancing responsiveness against verbosity. Yet most video-language models (VideoLLMs) are trained for offline inference, and existing streaming benchmarks externalize this timing decision to the evaluator. We address this gap with RealStreamEval, a frame-level multi-turn evaluation protocol that exposes models to sequential observations and penalizes unnecessary responses. Under this protocol, we observed that strong offline VideoLLMs retain useful visual understanding but lack an interaction policy for deciding when to respond. Motivated by this observation, we propose EvoStreaming, a self-evolved streaming adaptation framework in which the base model itself acts as data generator, relevance annotator, and roll-out policy to synthesize streaming trajectories…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
