Multimodal ELBO with Diffusion Decoders
Daniel Wesego, Pedram Rooshenas

TL;DR
This paper introduces a multimodal VAE with a diffusion decoder and auxiliary score model, significantly improving the quality and coherence of generated multimodal data, especially for complex modalities like images.
Contribution
It proposes a novel multimodal VAE variant integrating a diffusion generative model and an auxiliary score-based model, enhancing output quality and coherence.
Findings
Achieves state-of-the-art results on multiple datasets.
Generates high-quality, coherent multimodal outputs.
Outperforms existing multimodal VAEs in quality and diversity.
Abstract
Multimodal variational autoencoders have demonstrated their ability to learn the relationships between different modalities by mapping them into a latent representation. Their design and capacity to perform any-to-any conditional and unconditional generation make them appealing. However, different variants of multimodal VAEs often suffer from generating low-quality output, particularly when complex modalities such as images are involved. In addition to that, they frequently exhibit low coherence among the generated modalities when sampling from the joint distribution. To address these limitations, we propose a new variant of the multimodal VAE ELBO that incorporates a better decoder using a diffusion generative model. The diffusion decoder enables the model to learn complex modalities and generate high-quality outputs. The multimodal model can also seamlessly integrate with a standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Semantic Web and Ontologies
MethodsDiffusion
