Multimodal ELBO with Diffusion Decoders

Daniel Wesego; Pedram Rooshenas

arXiv:2408.16883·cs.LG·February 4, 2025

Multimodal ELBO with Diffusion Decoders

Daniel Wesego, Pedram Rooshenas

PDF

Open Access

TL;DR

This paper introduces a multimodal VAE with a diffusion decoder and auxiliary score model, significantly improving the quality and coherence of generated multimodal data, especially for complex modalities like images.

Contribution

It proposes a novel multimodal VAE variant integrating a diffusion generative model and an auxiliary score-based model, enhancing output quality and coherence.

Findings

01

Achieves state-of-the-art results on multiple datasets.

02

Generates high-quality, coherent multimodal outputs.

03

Outperforms existing multimodal VAEs in quality and diversity.

Abstract

Multimodal variational autoencoders have demonstrated their ability to learn the relationships between different modalities by mapping them into a latent representation. Their design and capacity to perform any-to-any conditional and unconditional generation make them appealing. However, different variants of multimodal VAEs often suffer from generating low-quality output, particularly when complex modalities such as images are involved. In addition to that, they frequently exhibit low coherence among the generated modalities when sampling from the joint distribution. To address these limitations, we propose a new variant of the multimodal VAE ELBO that incorporates a better decoder using a diffusion generative model. The diffusion decoder enables the model to learn complex modalities and generate high-quality outputs. The multimodal model can also seamlessly integrate with a standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Semantic Web and Ontologies

MethodsDiffusion