Lumina-T2X: Transforming Text into Any Modality, Resolution, and   Duration via Flow-based Large Diffusion Transformers

Peng Gao; Le Zhuo; Dongyang Liu; Ruoyi Du; Xu Luo; Longtian Qiu,; Yuhang Zhang; Chen Lin; Rongjie Huang; Shijie Geng; Renrui Zhang; Junlin Xi,; Wenqi Shao; Zhengkai Jiang; Tianshuo Yang; Weicai Ye; He Tong; Jingwen He; Yu; Qiao; Hongsheng Li

arXiv:2405.05945·cs.CV·June 14, 2024·1 cites

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

Peng Gao, Le Zhuo, Dongyang Liu, Ruoyi Du, Xu Luo, Longtian Qiu,, Yuhang Zhang, Chen Lin, Rongjie Huang, Shijie Geng, Renrui Zhang, Junlin Xi,, Wenqi Shao, Zhengkai Jiang, Tianshuo Yang, Weicai Ye, He Tong, Jingwen He, Yu, Qiao, Hongsheng Li

PDF

Open Access 2 Repos 5 Models

TL;DR

Lumina-T2X introduces a unified, flow-based diffusion transformer framework capable of generating high-quality images, videos, 3D objects, and audio from text at arbitrary resolutions and durations, with scalable models and efficient training.

Contribution

It presents Lumina-T2X, a novel unified framework using flow-based diffusion transformers for multimodal generation across various resolutions and modalities, with scalable model sizes and advanced techniques for stability.

Findings

01

Supports multimodal generation at any resolution and duration.

02

Achieves high scalability with models up to 7 billion parameters.

03

Reduces training costs significantly compared to naive models.

Abstract

Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical report, we introduce the Lumina-T2X family - a series of Flow-based Large Diffusion Transformers (Flag-DiT) equipped with zero-initialized attention, as a unified framework designed to transform noise into images, videos, multi-view 3D objects, and audio clips conditioned on text instructions. By tokenizing the latent spatial-temporal space and incorporating learnable placeholders such as [nextline] and [nextframe] tokens, Lumina-T2X seamlessly unifies the representations of different modalities across various spatial-temporal resolutions. This unified approach enables training within a single framework for different modalities and allows for flexible…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Computational Physics and Python Applications

MethodsAttention Is All You Need · Dropout · Label Smoothing · Residual Connection · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Linear Layer · Byte Pair Encoding · Adam