TL;DR
Muddit is a unified discrete diffusion transformer that enables fast, parallel multimodal generation across text and images by integrating pretrained visual priors with a lightweight text decoder.
Contribution
It introduces Muddit, a second-generation unified discrete diffusion model that combines strong visual priors with a lightweight decoder for efficient multimodal generation.
Findings
Muddit achieves competitive or superior performance to larger autoregressive models.
It enables fast, parallel generation across text and image modalities.
The model demonstrates the effectiveness of purely discrete diffusion with strong priors.
Abstract
Unified generation models aim to handle diverse tasks across modalities -- such as text generation, image generation, and vision-language reasoning -- within a single architecture and decoding paradigm. Autoregressive unified models suffer from slow inference due to sequential decoding, and non-autoregressive unified models suffer from weak generalization due to limited pretrained backbones. We introduce the second-generation Meissonic: Muddit, a unified discrete diffusion transformer that enables fast and parallel generation across both text and image modalities. Unlike prior unified diffusion models trained from scratch, Muddit integrates strong visual priors from a pretrained text-to-image backbone with a lightweight text decoder, enabling flexible and high-quality multimodal generation under a unified architecture. Empirical results show that Muddit achieves competitive or superior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
