CoVAE: correlated multimodal generative modeling

Federico Caretti; Guido Sanguinetti

arXiv:2603.01965·cs.LG·March 3, 2026

CoVAE: correlated multimodal generative modeling

Federico Caretti, Guido Sanguinetti

PDF

Open Access

TL;DR

CoVAE introduces a novel generative model that captures correlations between modalities in multimodal data, improving reconstruction accuracy and uncertainty quantification compared to existing methods.

Contribution

This work presents CoVAE, a new architecture that preserves joint statistical structure in multimodal data, addressing limitations of previous latent space fusion strategies.

Findings

01

Accurate cross-modal reconstruction demonstrated.

02

Effective uncertainty quantification achieved.

03

Works on both real and synthetic datasets.

Abstract

Multimodal Variational Autoencoders have emerged as a popular tool to extract effective representations from rich multimodal data. However, such models rely on fusion strategies in latent space that destroy the joint statistical structure of the multimodal data, with profound implications for generation and uncertainty quantification. In this work, we introduce Correlated Variational Autoencoders (CoVAE), a new generative architecture that captures the correlations between modalities. We test CoVAE on a number of real and synthetic data sets demonstrating both accurate cross-modal reconstruction and effective quantification of the associated uncertainties.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Tensor decomposition and applications · Face recognition and analysis