StreamDiT: Real-Time Streaming Text-to-Video Generation

Akio Kodaira; Tingbo Hou; Ji Hou; Markos Georgopoulos; Felix Juefei-Xu; Masayoshi Tomizuka; Yue Zhao

arXiv:2507.03745·cs.CV·March 30, 2026

StreamDiT: Real-Time Streaming Text-to-Video Generation

Akio Kodaira, Tingbo Hou, Ji Hou, Markos Georgopoulos, Felix Juefei-Xu, Masayoshi Tomizuka, Yue Zhao

PDF

1 Repo

TL;DR

StreamDiT is a streaming text-to-video generation model that achieves real-time performance at 16 FPS on a single GPU, enabling interactive video applications.

Contribution

The paper introduces a novel streaming video generation approach with flow matching, mixed training, and multistep distillation for real-time text-to-video synthesis.

Findings

01

Real-time generation at 16 FPS on one GPU

02

High-quality videos at 512p resolution

03

Effective distillation reduces NFEs for faster sampling

Abstract

Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://cumulo-autumn.github.io/StreamDiT
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.