On the Limitations of Multimodal VAEs

Imant Daunhawer; Thomas M. Sutter; Kieran Chin-Cheong; Emanuele; Palumbo; Julia E. Vogt

arXiv:2110.04121·cs.LG·April 8, 2022

On the Limitations of Multimodal VAEs

Imant Daunhawer, Thomas M. Sutter, Kieran Chin-Cheong, Emanuele, Palumbo, Julia E. Vogt

PDF

Open Access 1 Video

TL;DR

This paper reveals fundamental limitations of multimodal VAEs, showing that their generative quality is inherently capped due to sub-sampling issues, which impacts their effectiveness on complex datasets.

Contribution

The paper formally proves a key limitation of mixture-based multimodal VAEs and empirically demonstrates its impact on generative quality across synthetic and real data.

Findings

01

Generative quality gap between multimodal and unimodal VAEs.

02

Sub-sampling of modalities imposes an upper bound on the ELBO.

03

Existing multimodal VAE variants do not meet all effectiveness criteria on complex datasets.

Abstract

Multimodal variational autoencoders (VAEs) have shown promise as efficient generative models for weakly-supervised data. Yet, despite their advantage of weak supervision, they exhibit a gap in generative quality compared to unimodal VAEs, which are completely unsupervised. In an attempt to explain this gap, we uncover a fundamental limitation that applies to a large family of mixture-based multimodal VAEs. We prove that the sub-sampling of modalities enforces an undesirable upper bound on the multimodal ELBO and thereby limits the generative quality of the respective models. Empirically, we showcase the generative quality gap on both synthetic and real data and present the tradeoffs between different variants of multimodal VAEs. We find that none of the existing approaches fulfills all desired criteria of an effective multimodal generative model when applied on more complex datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

On the Limitations of Multimodal VAEs· slideslive

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music and Audio Processing