Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yi Xin, Xinyue Li,, Qi Qin, Yu Qiao, Hongsheng Li, Peng Gao

TL;DR
Lumina-mGPT is a multimodal autoregressive model that generates photorealistic images from text with high efficiency and versatility, advancing unified vision-language modeling.
Contribution
It introduces a novel multimodal autoregressive framework with flexible image representation and fine-tuning strategies, achieving competitive image generation and multimodal tasks.
Findings
Achieves image generation performance comparable to diffusion models.
Supports high-quality images with varying aspect ratios.
Demonstrates versatile multimodal capabilities including generation and recognition tasks.
Abstract
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. By initializing from multimodal Generative PreTraining (mGPT), we demonstrate that decoder-only Autoregressive (AR) model can achieve image generation performance comparable to modern diffusion models with high efficiency through Flexible Progressive Supervised Fine-tuning (FP-SFT). Equipped with our proposed Unambiguous image Representation (UniRep), Lumina-mGPT can flexibly generate high-quality images of varying aspect ratios. Building on the strong image generation capabilities, we further explore Ominiponent Supervised Fine-tuning (Omni-SFT), an initial attempt to elevate Lumina-mGPT into a unified multi-modal generalist. The resulting model demonstrates versatile multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗Alpha-VLLM/Lumina-mGPT-7B-768-Omnimodel· 11 dl· ♡ 911 dl♡ 9
- 🤗Alpha-VLLM/Lumina-mGPT-7B-768model· 1.1k dl· ♡ 381.1k dl♡ 38
- 🤗Alpha-VLLM/Lumina-mGPT-7B-512-MultiImagemodel· 92 dl· ♡ 592 dl♡ 5
- 🤗Alpha-VLLM/Lumina-mGPT-7B-1024model· 19 dl· ♡ 1019 dl♡ 10
- 🤗Alpha-VLLM/Lumina-mGPT-7B-512model· 211 dl· ♡ 4211 dl♡ 4
- 🤗Alpha-VLLM/Lumina-mGPT-34B-512model· 5 dl· ♡ 35 dl♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Multimodal Machine Learning Applications · Digital Storytelling and Education
MethodsDiffusion
