Diffusion Models for Joint Audio-Video Generation

Alejandro Paredes La Torre

arXiv:2603.16093·cs.SD·March 18, 2026

Diffusion Models for Joint Audio-Video Generation

Alejandro Paredes La Torre

PDF

Open Access

TL;DR

This paper advances joint audio-video generation by releasing new datasets, training a diffusion model from scratch, analyzing multimodal decoding challenges, and proposing a sequential generation pipeline for high-fidelity results.

Contribution

It introduces new paired audio-video datasets, trains a diffusion architecture from scratch, investigates multimodal decoding issues, and proposes a sequential generation pipeline.

Findings

01

High-quality paired datasets released for research.

02

Diffusion model produces semantically coherent audio-video pairs.

03

Sequential generation pipeline achieves high-fidelity audio-video synthesis.

Abstract

Multimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, paired audio-video datasets. The datasets consisting on 13 hours of video-game clips and 64 hours of concert performances, each segmented into consistent 34-second samples to facilitate reproducible research. Second, I train the MM-Diffusion architecture from scratch on our datasets, demonstrating its ability to produce semantically coherent audio-video pairs and quantitatively evaluating alignment on rapid actions and musical cues. Third, I investigate joint latent diffusion by leveraging pretrained video and audio encoder-decoders, uncovering challenges and inconsistencies in the multimodal decoding stage.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Speech and Audio Processing