Multimodal Variational Autoencoders for Semi-Supervised Learning: In Defense of Product-of-Experts
Svetlana Kutuzova, Oswin Krause, Douglas McCloskey, Mads Nielsen,, Christian Igel

TL;DR
This paper introduces a product-of-experts variational autoencoder for multimodal semi-supervised learning, demonstrating its advantages over other methods in generating and sampling multiple modalities coherently.
Contribution
It proposes a novel PoE-based VAE that effectively handles semi-supervised multimodal learning and outperforms existing mixture-of-experts and encoder-based approaches.
Findings
PoE models outperform MoE and encoder-based models in benchmarks.
PoE models better support joint generation of multiple modalities.
Empirical results validate PoE's suitability for conjunctive modality combination.
Abstract
Multimodal generative models should be able to learn a meaningful latent representation that enables a coherent joint generation of all modalities (e.g., images and text). Many applications also require the ability to accurately sample modalities conditioned on observations of a subset of the modalities. Often not all modalities may be observed for all training data points, so semi-supervised learning should be possible. In this study, we propose a novel product-of-experts (PoE) based variational autoencoder that have these desired properties. We benchmark it against a mixture-of-experts (MoE) approach and an approach of combining the modalities with an additional encoder network. An empirical evaluation shows that the PoE based models can outperform the contrasted models. Our experiments support the intuition that PoE models are more suited for a conjunctive combination of modalities.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Music and Audio Processing · Topic Modeling
