TL;DR
This paper introduces LDDBM, a versatile latent diffusion framework for modality translation that operates across diverse sensory domains without restrictive assumptions, improving generality and performance.
Contribution
The work presents a novel latent-variable diffusion model with contrastive and predictive losses, enabling flexible, domain-agnostic modality translation across multiple tasks.
Findings
Supports arbitrary modality pairs
Achieves strong results on diverse tasks
Establishes new baseline in modality translation
Abstract
Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
