aMUSEd: An Open MUSE Reproduction
Suraj Patil, William Berman, Robin Rombach, Patrick von Platen

TL;DR
aMUSEd introduces a lightweight, open-source masked image model for fast, interpretable text-to-image generation, requiring fewer parameters and inference steps than latent diffusion, with the ability to learn new styles from minimal data.
Contribution
It presents aMUSEd, a compact MIM model that enhances text-to-image generation speed and interpretability, and provides reproducible code and checkpoints for large-scale use.
Findings
aMUSEd achieves fast image generation with 10% of MUSE's parameters.
It requires fewer inference steps than latent diffusion models.
The model can learn new styles from a single image.
Abstract
We present aMUSEd, an open-source, lightweight masked image model (MIM) for text-to-image generation based on MUSE. With 10 percent of MUSE's parameters, aMUSEd is focused on fast image generation. We believe MIM is under-explored compared to latent diffusion, the prevailing approach for text-to-image generation. Compared to latent diffusion, MIM requires fewer inference steps and is more interpretable. Additionally, MIM can be fine-tuned to learn additional styles with only a single image. We hope to encourage further exploration of MIM by demonstrating its effectiveness on large-scale text-to-image generation and releasing reproducible training code. We also release checkpoints for two models which directly produce images at 256x256 and 512x512 resolutions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
MethodsMutual Information Machine/Mask Image Modeling
