Diffusion Models for Joint Audio-Video Generation
Alejandro Paredes La Torre

TL;DR
This paper advances joint audio-video generation by releasing new datasets, training a diffusion model from scratch, analyzing multimodal decoding challenges, and proposing a sequential generation pipeline for high-fidelity results.
Contribution
It introduces new paired audio-video datasets, trains a diffusion architecture from scratch, investigates multimodal decoding issues, and proposes a sequential generation pipeline.
Findings
High-quality paired datasets released for research.
Diffusion model produces semantically coherent audio-video pairs.
Sequential generation pipeline achieves high-fidelity audio-video synthesis.
Abstract
Multimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, paired audio-video datasets. The datasets consisting on 13 hours of video-game clips and 64 hours of concert performances, each segmented into consistent 34-second samples to facilitate reproducible research. Second, I train the MM-Diffusion architecture from scratch on our datasets, demonstrating its ability to produce semantically coherent audio-video pairs and quantitatively evaluating alignment on rapid actions and musical cues. Third, I investigate joint latent diffusion by leveraging pretrained video and audio encoder-decoders, uncovering challenges and inconsistencies in the multimodal decoding stage.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Speech and Audio Processing
