Diffusion Models For Multi-Modal Generative Modeling
Changyou Chen, Han Ding, Bunyamin Sisman, Yi Xu, Ouye Xie, Benjamin Z., Yao, Son Dinh Tran, Belinda Zeng

TL;DR
This paper introduces a unified multi-modal diffusion model that can generate and handle various types of data simultaneously, advancing the capabilities of diffusion-based generative modeling.
Contribution
The paper proposes a novel multi-modal diffusion framework with a shared backbone and modality-specific decoders, enabling multi-task learning and multi-modal data generation.
Findings
Effective in image transition and masked-image training
Supports joint image-label and image-representation modeling
Shows promising results on ImageNet
Abstract
Diffusion-based generative modeling has been achieving state-of-the-art results on various generation tasks. Most diffusion models, however, are limited to a single-generation modeling. Can we generalize diffusion models with the ability of multi-modal generative training for more generalizable modeling? In this paper, we propose a principled way to define a diffusion model by constructing a unified multi-modal diffusion model in a common diffusion space. We define the forward diffusion process to be driven by an information aggregation from multiple types of task-data, e.g., images for a generation task and labels for a classification task. In the reverse process, we enforce information sharing by parameterizing a shared backbone denoising network with additional modality-specific decoder heads. Such a structure can simultaneously learn to generate different types of multi-modal data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications · Reinforcement Learning in Robotics
MethodsDiffusion
