One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang, Yue, Yue Cao, Hang Su, Jun Zhu

TL;DR
UniDiffuser introduces a unified transformer-based diffusion framework capable of modeling and generating multiple modalities and their combinations simultaneously, achieving high-quality results across diverse multi-modal tasks.
Contribution
The paper presents a novel unified diffusion model that handles marginal, conditional, and joint distributions across modalities with minimal modifications to existing diffusion models.
Findings
Achieves state-of-the-art FID and CLIP scores on multi-modal tasks.
Capable of generating high-quality images, text, and their combinations.
Performs comparably to specialized models like Stable Diffusion and DALL-E 2.
Abstract
This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Mycobacterium research and diagnosis · Advanced Neuroimaging Techniques and Applications
MethodsDiffusion · Contrastive Language-Image Pre-training
