Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation   with Multimodal Generative Pretraining

Dongyang Liu; Shitian Zhao; Le Zhuo; Weifeng Lin; Yi Xin; Xinyue Li,; Qi Qin; Yu Qiao; Hongsheng Li; Peng Gao

arXiv:2408.02657·cs.CV·April 25, 2025·2 cites

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yi Xin, Xinyue Li,, Qi Qin, Yu Qiao, Hongsheng Li, Peng Gao

PDF

Open Access 2 Repos 6 Models

TL;DR

Lumina-mGPT is a multimodal autoregressive model that generates photorealistic images from text with high efficiency and versatility, advancing unified vision-language modeling.

Contribution

It introduces a novel multimodal autoregressive framework with flexible image representation and fine-tuning strategies, achieving competitive image generation and multimodal tasks.

Findings

01

Achieves image generation performance comparable to diffusion models.

02

Supports high-quality images with varying aspect ratios.

03

Demonstrates versatile multimodal capabilities including generation and recognition tasks.

Abstract

We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. By initializing from multimodal Generative PreTraining (mGPT), we demonstrate that decoder-only Autoregressive (AR) model can achieve image generation performance comparable to modern diffusion models with high efficiency through Flexible Progressive Supervised Fine-tuning (FP-SFT). Equipped with our proposed Unambiguous image Representation (UniRep), Lumina-mGPT can flexibly generate high-quality images of varying aspect ratios. Building on the strong image generation capabilities, we further explore Ominiponent Supervised Fine-tuning (Omni-SFT), an initial attempt to elevate Lumina-mGPT into a unified multi-modal generalist. The resulting model demonstrates versatile multimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Multimodal Machine Learning Applications · Digital Storytelling and Education

MethodsDiffusion