Hellinger Multimodal Variational Autoencoders

Huyen Vo; Isabel Valera

arXiv:2601.06572·cs.LG·April 1, 2026

Hellinger Multimodal Variational Autoencoders

Huyen Vo, Isabel Valera

PDF

TL;DR

This paper introduces HELVAE, a multimodal VAE leveraging Hellinger pooling for improved latent representations and generative quality, outperforming existing models.

Contribution

The authors propose a novel Hellinger pooling-based inference method for multimodal VAEs, enhancing expressiveness and efficiency over prior approaches.

Findings

01

HELVAE achieves better trade-offs between coherence and quality.

02

The model learns more expressive latent representations with additional modalities.

03

It outperforms state-of-the-art multimodal VAE models.

Abstract

Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from H\"older pooling with $α = 0.5$ , which corresponds to the unique symmetric member of the $α -divergence$ family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.