CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional   Modeling

Ruihan Yang; Hannes Gamper; Sebastian Braun

arXiv:2312.05412·cs.LG·October 10, 2024·1 cites

CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling

Ruihan Yang, Hannes Gamper, Sebastian Braun

PDF

Open Access

TL;DR

This paper presents CMMD, a multi-modal diffusion model for synchronized video and audio generation, utilizing contrastive training and a novel fusion architecture to enhance quality, speed, and alignment.

Contribution

The paper introduces a bi-directional conditional diffusion model with a contrastive loss and a new fusion block for improved video-audio generation and synchronization.

Findings

01

Outperforms baseline in quality and speed

02

Improves audio-visual alignment with contrastive loss

03

Effective on multiple datasets

Abstract

We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences. We present experiments on two datasets to evaluate the efficacy of our proposed model. The assessment of generation quality and alignment performance is carried out from various angles, encompassing both objective and subjective metrics. Our findings demonstrate that the proposed model outperforms the baseline in terms of quality and generation speed through introduction of our novel cross-modal easy fusion architectural block. Furthermore, the incorporation of the contrastive loss results in improvements in audio-visual alignment, particularly in the high-correlation video-to-audio generation task.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion