TL;DR
Streamo is a versatile real-time streaming video LLM capable of multiple tasks like narration, action understanding, and question answering, trained on a large instruction-following dataset for broad generalization.
Contribution
Introduces Streamo, a real-time streaming video LLM with a new large-scale instruction dataset, enabling unified training for diverse streaming video tasks.
Findings
Streamo demonstrates strong temporal reasoning and interaction.
It generalizes well across various streaming video benchmarks.
Bridges the gap between offline perception models and real-time assistants.
Abstract
We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
