MoMu-Diffusion: On Learning Long-Term Motion-Music Synchronization and Correspondence
Fuming You, Minghui Fang, Li Tang, Rongjie Huang, Yongqi Wang, Zhou, Zhao

TL;DR
MoMu-Diffusion introduces a unified framework for long-term, synchronized motion-music generation using a novel auto-encoder and diffusion model, enabling diverse cross-modal and variable-length synthesis with improved realism.
Contribution
The paper presents a novel BiCoR-VAE for efficient modality-aligned representation learning and a multi-modal diffusion model for synchronized motion-music generation, addressing long-term sequence challenges.
Findings
Outperforms state-of-the-art methods in quality and diversity
Capable of long-term, beat-matched motion and music synthesis
Effective in cross-modal and multi-modal generation tasks
Abstract
Motion-to-music and music-to-motion have been studied separately, each attracting substantial research interest within their respective domains. The interaction between human motion and music is a reflection of advanced human intelligence, and establishing a unified relationship between them is particularly important. However, to date, there has been no work that considers them jointly to explore the modality alignment within. To bridge this gap, we propose a novel framework, termed MoMu-Diffusion, for long-term and synchronous motion-music generation. Firstly, to mitigate the huge computational costs raised by long sequences, we propose a novel Bidirectional Contrastive Rhythmic Variational Auto-Encoder (BiCoR-VAE) that extracts the modality-aligned latent representations for both motion and music inputs. Subsequently, leveraging the aligned latent spaces, we introduce a multi-modal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
