Learning Multimodal Energy-Based Model with Multimodal Variational Auto-Encoder via MCMC Revision

Jiali Cui; Zhiqiang Lao; Heather Yu

arXiv:2605.00644·cs.LG·May 4, 2026

Learning Multimodal Energy-Based Model with Multimodal Variational Auto-Encoder via MCMC Revision

Jiali Cui, Zhiqiang Lao, Heather Yu

PDF

TL;DR

This paper introduces a novel learning framework for multimodal energy-based models that combines MCMC refinement with variational auto-encoders, improving the quality and coherence of multimodal data synthesis.

Contribution

It proposes an integrated training approach that interweaves MLE updates with MCMC refinements for multimodal EBMs and VAEs, enhancing sampling effectiveness.

Findings

01

Achieves superior multimodal synthesis quality and coherence.

02

Demonstrates effective scalability and robustness through extensive experiments.

03

Provides ablation studies validating the proposed framework's components.

Abstract

Energy-based models (EBMs) are a flexible class of deep generative models and are well-suited to capture complex dependencies in multimodal data. However, learning multimodal EBM by maximum likelihood requires Markov Chain Monte Carlo (MCMC) sampling in the joint data space, where noise-initialized Langevin dynamics often mixes poorly and fails to discover coherent inter-modal relationships. Multimodal VAEs have made progress in capturing such inter-modal dependencies by introducing a shared latent generator and a joint inference model. However, both the shared latent generator and joint inference model are parameterized as unimodal Gaussian (or Laplace), which severely limits their ability to approximate the complex structure induced by multimodal data. In this work, we study the learning problem of the multimodal EBM, shared latent generator, and joint inference model. We present a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.