CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling
Ruihan Yang, Hannes Gamper, Sebastian Braun

TL;DR
This paper presents CMMD, a multi-modal diffusion model for synchronized video and audio generation, utilizing contrastive training and a novel fusion architecture to enhance quality, speed, and alignment.
Contribution
The paper introduces a bi-directional conditional diffusion model with a contrastive loss and a new fusion block for improved video-audio generation and synchronization.
Findings
Outperforms baseline in quality and speed
Improves audio-visual alignment with contrastive loss
Effective on multiple datasets
Abstract
We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences. We present experiments on two datasets to evaluate the efficacy of our proposed model. The assessment of generation quality and alignment performance is carried out from various angles, encompassing both objective and subjective metrics. Our findings demonstrate that the proposed model outperforms the baseline in terms of quality and generation speed through introduction of our novel cross-modal easy fusion architectural block. Furthermore, the incorporation of the contrastive loss results in improvements in audio-visual alignment, particularly in the high-correlation video-to-audio generation task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion
