MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation
Mingzhen Sun, Weining Wang, Yanyuan Qiao, Jiahui Sun, Zihan Qin,, Longteng Guo, Xinxin Zhu, Jing Liu

TL;DR
This paper introduces MM-LDM, a multi-modal latent diffusion model that unifies audio and video representations for high-quality, efficient sounding video generation, demonstrating state-of-the-art results and broad adaptability.
Contribution
The paper presents a novel hierarchical multi-modal autoencoder that constructs shared semantic and perceptual latent spaces for audio and video, improving generation quality and efficiency.
Findings
Achieves state-of-the-art results on multiple datasets.
Significantly improves training and sampling speed.
Demonstrates strong adaptability across various generation tasks.
Abstract
Sounding Video Generation (SVG) is an audio-video joint generation task challenged by high-dimensional signal spaces, distinct data formats, and different patterns of content information. To address these issues, we introduce a novel multi-modal latent diffusion model (MM-LDM) for the SVG task. We first unify the representation of audio and video data by converting them into a single or a couple of images. Then, we introduce a hierarchical multi-modal autoencoder that constructs a low-level perceptual latent space for each modality and a shared high-level semantic feature space. The former space is perceptually equivalent to the raw signal space of each modality but drastically reduces signal dimensions. The latter space serves to bridge the information gap between modalities and provides more insightful cross-modal guidance. Our proposed method achieves new state-of-the-art results…
Peer Reviews
Decision·Submitted to ICLR 2024
1. The application of diffusion in latent space has been demonstrated in the fields of image and video generation, thereby warranting its extension to multi-modal generation. This approach to model generation is highly relevant in the current context of machine learning and artificial intelligence, where multi-modal data is increasingly prevalent. By leveraging the power of diffusion in latent space, multi-modal models can be developed that can generate diverse outputs across various modalities.
1. I think the author's understanding of latent diffusion is not deep enough, and there are many unprofessional and unscientific descriptions in the writing process. For details, see Questions 1, 2 and 4. 2. The sign of equation (7) is confusing. According to the paper, (n_a^t,n_v^t) are predicted noise features. But obviously this variable is not a predicted value, but a variable that satisfies the N(0,1) distribution. 3. The Implementation Details only give the training details of the multi-
1. The idea of modeling audio and video in latent space for sounding video generation is interesting and promising. 2. The writing is good and the results further demonstrate the effectiveness of the proposed method.
1. Similar ideas to the conditional generation section have been proposed in many papers which seems too weak to list as a technical contribution in the paper. I would like the author to claim this point as a "bonus" of the proposed model in the paper. 2. The visual quality of MM-Diffusion results in Fig. 4 seems quite different from their original paper even considering the result has been super-resolved by the SR model. Is there any explanation for that? The visual quality of the results seems
1. The authors attempt to solve a novel and valuable problem and design a reasonable framework for this purpose. 2. The multimodal VAE designed by the author is interesting, establishing semantic latent spaces for audio and video modalities. Further, the authors use a shared multimodal decoder introduced in cross-modal alignment, which can inspire future multimodal generation. 3. The experimental results are promising in metrics, demonstrating the effectiveness of the proposed method. In particu
1. The designing of a multimodal VAE is innovative, but it may not be as effective as that of two separate VAEs. The authors should compare their multimodal VAE with most direct audio and video VAEs, which would better demonstrate the effectiveness of multimodal VAE. 2. I have viewed the generated results provided by the author, and only some results from the AIST++ dataset are available (MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation (anonymouss765.github.io). However,
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing
MethodsLatent Diffusion Model · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion
