Enhancing Unimodal Latent Representations in Multimodal VAEs through   Iterative Amortized Inference

Yuta Oshima; Masahiro Suzuki; Yutaka Matsuo

arXiv:2410.11403·cs.LG·October 16, 2024

Enhancing Unimodal Latent Representations in Multimodal VAEs through Iterative Amortized Inference

Yuta Oshima, Masahiro Suzuki, Yutaka Matsuo

PDF

Open Access

TL;DR

This paper introduces an iterative inference method for multimodal VAEs that improves unimodal latent representations by refining inference with all available modalities, reducing information loss and amortization gaps.

Contribution

We propose multimodal iterative amortized inference, which enhances unimodal inference accuracy by iterative refinement, addressing limitations of existing mixture and alignment-based models.

Findings

01

Improved linear classification accuracy on benchmark datasets.

02

Lower FID scores indicating better cross-modal generation.

03

Enhanced unimodal latent representations with all modalities available.

Abstract

Multimodal variational autoencoders (VAEs) aim to capture shared latent representations by integrating information from different data modalities. A significant challenge is accurately inferring representations from any subset of modalities without training an impractical number (2^M) of inference networks for all possible modality combinations. Mixture-based models simplify this by requiring only as many inference models as there are modalities, aggregating unimodal inferences. However, they suffer from information loss when modalities are missing. Alignment-based VAEs address this by aligning unimodal inference models with a multimodal model through minimizing the Kullback-Leibler (KL) divergence but face issues due to amortization gaps, which compromise inference accuracy. To tackle these problems, we introduce multimodal iterative amortized inference, an iterative refinement…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems