Streaming Video Instruction Tuning

Jiaer Xia; Peixian Chen; Mengdan Zhang; Xing Sun; Kaiyang Zhou

arXiv:2512.21334·cs.CV·April 13, 2026

Streaming Video Instruction Tuning

Jiaer Xia, Peixian Chen, Mengdan Zhang, Xing Sun, Kaiyang Zhou

PDF

1 Repo

TL;DR

Streamo is a versatile real-time streaming video LLM capable of multiple tasks like narration, action understanding, and question answering, trained on a large instruction-following dataset for broad generalization.

Contribution

Introduces Streamo, a real-time streaming video LLM with a new large-scale instruction dataset, enabling unified training for diverse streaming video tasks.

Findings

01

Streamo demonstrates strong temporal reasoning and interaction.

02

It generalizes well across various streaming video benchmarks.

03

Bridges the gap between offline perception models and real-time assistants.

Abstract

We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

maifoundations/Streamo
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.