Muse: Text-To-Image Generation via Masked Generative Transformers

Huiwen Chang; Han Zhang; Jarred Barber; AJ Maschinot; Jose Lezama; Lu; Jiang; Ming-Hsuan Yang; Kevin Murphy; William T. Freeman; Michael Rubinstein,; Yuanzhen Li; Dilip Krishnan

arXiv:2301.00704·cs.CV·January 3, 2023·119 cites

Muse: Text-To-Image Generation via Masked Generative Transformers

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu, Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein,, Yuanzhen Li, Dilip Krishnan

PDF

Open Access 5 Repos 2 Models

TL;DR

Muse is an efficient text-to-image Transformer that leverages masked modeling in discrete token space, achieving state-of-the-art performance and enabling versatile image editing without fine-tuning.

Contribution

Introducing Muse, a novel masked generative Transformer that combines efficiency with high-quality image synthesis and editing capabilities, surpassing existing diffusion and autoregressive models.

Findings

01

Achieved SOTA FID score of 6.06 on CC3M.

02

Attained an FID of 7.88 and CLIP score of 0.32 on zero-shot COCO.

03

Enabled multiple image editing applications without fine-tuning.

Abstract

We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Adam · Byte Pair Encoding · Residual Connection · Label Smoothing · Contrastive Language-Image Pre-training