Multimodal Variational Autoencoder: a Barycentric View
Peijie Qiu, Wenhui Zhu, Sayantan Kumar, Xiwen Chen, Xiaotong Sun, Jin, Yang, Abolfazl Razi, Yalin Wang, Aristeidis Sotiras

TL;DR
This paper introduces a barycentric framework for multimodal variational autoencoders, leveraging different divergence measures like Wasserstein distance to improve the learning of shared and modality-specific representations across multiple data modalities.
Contribution
It provides a novel theoretical formulation of multimodal VAEs using barycenters, extending existing methods with flexible divergence choices, notably the Wasserstein barycenter.
Findings
Wasserstein barycenter better preserves distribution geometry.
The proposed method outperforms traditional PoE and MoE approaches.
Empirical results on three benchmarks validate effectiveness.
Abstract
Multiple signal modalities, such as vision and sounds, are naturally present in real-world phenomena. Recently, there has been growing interest in learning generative models, in particular variational autoencoder (VAE), to for multimodal representation learning especially in the case of missing modalities. The primary goal of these models is to learn a modality-invariant and modality-specific representation that characterizes information across multiple modalities. Previous attempts at multimodal VAEs approach this mainly through the lens of experts, aggregating unimodal inference distributions with a product of experts (PoE), a mixture of experts (MoE), or a combination of both. In this paper, we provide an alternative generic and theoretical formulation of multimodal VAE through the lens of barycenter. We first show that PoE and MoE are specific instances of barycenters, derived by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNeural Networks and Applications
MethodsMixture of Experts
