MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation
Mingyang Huang, Peng Zhang, Bang Zhang

TL;DR
MotionRAG-Diff is a novel hybrid framework that combines retrieval-augmented generation and diffusion models to produce long-term, coherent, and musically synchronized dance sequences conditioned on arbitrary music inputs.
Contribution
It introduces a cross-modal contrastive learning architecture, an optimized motion graph retrieval system, and a multi-condition diffusion model for improved long-term music-to-dance generation.
Findings
Achieves state-of-the-art motion quality and diversity.
Demonstrates superior music-motion synchronization accuracy.
Enables long-term, coherent dance sequence generation.
Abstract
Generating long-term, coherent, and realistic music-conditioned dance sequences remains a challenging task in human motion synthesis. Existing approaches exhibit critical limitations: motion graph methods rely on fixed template libraries, restricting creative generation; diffusion models, while capable of producing novel motions, often lack temporal coherence and musical alignment. To address these challenges, we propose , a hybrid framework that integrates Retrieval-Augmented Generation (RAG) with diffusion-based refinement to enable high-quality, musically coherent dance generation for arbitrary long-term music inputs. Our method introduces three core innovations: (1) A cross-modal contrastive learning architecture that aligns heterogeneous music and dance representations in a shared latent space, establishing unsupervised semantic correspondence without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Human Motion and Animation
MethodsDiffusion · Contrastive Learning
