Muse: Text-To-Image Generation via Masked Generative Transformers
Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu, Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein,, Yuanzhen Li, Dilip Krishnan

TL;DR
Muse is an efficient text-to-image Transformer that leverages masked modeling in discrete token space, achieving state-of-the-art performance and enabling versatile image editing without fine-tuning.
Contribution
Introducing Muse, a novel masked generative Transformer that combines efficiency with high-quality image synthesis and editing capabilities, surpassing existing diffusion and autoregressive models.
Findings
Achieved SOTA FID score of 6.06 on CC3M.
Attained an FID of 7.88 and CLIP score of 0.32 on zero-shot COCO.
Enabled multiple image editing applications without fine-tuning.
Abstract
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Adam · Byte Pair Encoding · Residual Connection · Label Smoothing · Contrastive Language-Image Pre-training
