Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT
Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie, Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang,, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli, Ouyang, Ziwei Liu, Yu Qiao, Hongsheng Li, Peng Gao

TL;DR
Lumina-Next enhances Lumina-T2X by improving training stability, inference speed, and resolution extrapolation, enabling high-quality, multilingual, and multi-modal generation across diverse tasks with a unified flow-based diffusion transformer framework.
Contribution
The paper introduces Lumina-Next, a significantly improved version of Lumina-T2X, with novel architecture modifications, extrapolation methods, and efficiency techniques for versatile, high-performance generative modeling.
Findings
Enhanced generation quality and speed in text-to-image tasks.
Superior resolution extrapolation and multilingual capabilities.
Strong performance across diverse modalities and tasks.
Abstract
Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Model Reduction and Neural Networks
MethodsDiffusion
