Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

Le Zhuo; Ruoyi Du; Han Xiao; Yangguang Li; Dongyang Liu; Rongjie; Huang; Wenze Liu; Lirui Zhao; Fu-Yun Wang; Zhanyu Ma; Xu Luo; Zehan Wang,; Kaipeng Zhang; Xiangyang Zhu; Si Liu; Xiangyu Yue; Dingning Liu; Wanli; Ouyang; Ziwei Liu; Yu Qiao; Hongsheng Li; Peng Gao

arXiv:2406.18583·cs.CV·June 28, 2024

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie, Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang,, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli, Ouyang, Ziwei Liu, Yu Qiao, Hongsheng Li, Peng Gao

PDF

Open Access 1 Repo

TL;DR

Lumina-Next enhances Lumina-T2X by improving training stability, inference speed, and resolution extrapolation, enabling high-quality, multilingual, and multi-modal generation across diverse tasks with a unified flow-based diffusion transformer framework.

Contribution

The paper introduces Lumina-Next, a significantly improved version of Lumina-T2X, with novel architecture modifications, extrapolation methods, and efficiency techniques for versatile, high-performance generative modeling.

Findings

01

Enhanced generation quality and speed in text-to-image tasks.

02

Superior resolution extrapolation and multilingual capabilities.

03

Strong performance across diverse modalities and tasks.

Abstract

Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alpha-vllm/lumina-t2x
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Model Reduction and Neural Networks

MethodsDiffusion