Lumina-Video: Efficient and Flexible Video Generation with Multi-scale   Next-DiT

Dongyang Liu; Shicheng Li; Yutong Liu; Zhen Li; Kai Wang; Xinyue Li,; Qi Qin; Yufei Liu; Yi Xin; Zhongyu Li; Bin Fu; Chenyang Si; Yuewen Cao,; Conghui He; Ziwei Liu; Yu Qiao; Qibin Hou; Hongsheng Li; Peng Gao

arXiv:2502.06782·cs.CV·February 13, 2025

Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT

Dongyang Liu, Shicheng Li, Yutong Liu, Zhen Li, Kai Wang, Xinyue Li,, Qi Qin, Yufei Liu, Yi Xin, Zhongyu Li, Bin Fu, Chenyang Si, Yuewen Cao,, Conghui He, Ziwei Liu, Yu Qiao, Qibin Hou, Hongsheng Li, Peng Gao

PDF

Open Access 2 Models

TL;DR

Lumina-Video introduces a multi-scale Next-DiT framework for efficient, flexible, and controllable high-quality video generation, leveraging tailored architectures and training schemes to address spatiotemporal complexity.

Contribution

It presents Lumina-Video with a multi-scale Next-DiT architecture, motion control, progressive training, and a novel video-to-audio model, advancing video synthesis capabilities.

Findings

01

High aesthetic quality and motion smoothness achieved

02

Efficient training and inference at high resolutions and FPS

03

Effective control of generated video dynamics

Abstract

Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-DiT. However, its potential for video generation remains largely untapped, with significant challenges in modeling the spatiotemporal complexity inherent to video data. To address this, we introduce Lumina-Video, a framework that leverages the strengths of Next-DiT while introducing tailored solutions for video synthesis. Lumina-Video incorporates a Multi-scale Next-DiT architecture, which jointly learns multiple patchifications to enhance both efficiency and flexibility. By incorporating the motion score as an explicit condition, Lumina-Video also enables direct control of generated videos' dynamic degree. Combined with a progressive training scheme…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Video Coding and Compression Technologies · Image and Video Quality Assessment