TL;DR
StreamDiT is a streaming text-to-video generation model that achieves real-time performance at 16 FPS on a single GPU, enabling interactive video applications.
Contribution
The paper introduces a novel streaming video generation approach with flow matching, mixed training, and multistep distillation for real-time text-to-video synthesis.
Findings
Real-time generation at 16 FPS on one GPU
High-quality videos at 512p resolution
Effective distillation reduces NFEs for faster sampling
Abstract
Recently, great progress has been achieved in text-to-video (T2V) generation by scaling transformer-based diffusion models to billions of parameters, which can generate high-quality videos. However, existing models typically produce only short clips offline, restricting their use cases in interactive and real-time applications. This paper addresses these challenges by proposing StreamDiT, a streaming video generation model. StreamDiT training is based on flow matching by adding a moving buffer. We design mixed training with different partitioning schemes of buffered frames to boost both content consistency and visual quality. StreamDiT modeling is based on adaLN DiT with varying time embedding and window attention. To practice the proposed method, we train a StreamDiT model with 4B parameters. In addition, we propose a multistep distillation method tailored for StreamDiT. Sampling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
