Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT
Dongyang Liu, Shicheng Li, Yutong Liu, Zhen Li, Kai Wang, Xinyue Li,, Qi Qin, Yufei Liu, Yi Xin, Zhongyu Li, Bin Fu, Chenyang Si, Yuewen Cao,, Conghui He, Ziwei Liu, Yu Qiao, Qibin Hou, Hongsheng Li, Peng Gao

TL;DR
Lumina-Video introduces a multi-scale Next-DiT framework for efficient, flexible, and controllable high-quality video generation, leveraging tailored architectures and training schemes to address spatiotemporal complexity.
Contribution
It presents Lumina-Video with a multi-scale Next-DiT architecture, motion control, progressive training, and a novel video-to-audio model, advancing video synthesis capabilities.
Findings
High aesthetic quality and motion smoothness achieved
Efficient training and inference at high resolutions and FPS
Effective control of generated video dynamics
Abstract
Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-DiT. However, its potential for video generation remains largely untapped, with significant challenges in modeling the spatiotemporal complexity inherent to video data. To address this, we introduce Lumina-Video, a framework that leverages the strengths of Next-DiT while introducing tailored solutions for video synthesis. Lumina-Video incorporates a Multi-scale Next-DiT architecture, which jointly learns multiple patchifications to enhance both efficiency and flexibility. By incorporating the motion score as an explicit condition, Lumina-Video also enables direct control of generated videos' dynamic degree. Combined with a progressive training scheme…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Video Coding and Compression Technologies · Image and Video Quality Assessment
