MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation

Mingyang Huang; Peng Zhang; Bang Zhang

arXiv:2506.02661·cs.SD·June 4, 2025

MotionRAG-Diff: A Retrieval-Augmented Diffusion Framework for Long-Term Music-to-Dance Generation

Mingyang Huang, Peng Zhang, Bang Zhang

PDF

Open Access

TL;DR

MotionRAG-Diff is a novel hybrid framework that combines retrieval-augmented generation and diffusion models to produce long-term, coherent, and musically synchronized dance sequences conditioned on arbitrary music inputs.

Contribution

It introduces a cross-modal contrastive learning architecture, an optimized motion graph retrieval system, and a multi-condition diffusion model for improved long-term music-to-dance generation.

Findings

01

Achieves state-of-the-art motion quality and diversity.

02

Demonstrates superior music-motion synchronization accuracy.

03

Enables long-term, coherent dance sequence generation.

Abstract

Generating long-term, coherent, and realistic music-conditioned dance sequences remains a challenging task in human motion synthesis. Existing approaches exhibit critical limitations: motion graph methods rely on fixed template libraries, restricting creative generation; diffusion models, while capable of producing novel motions, often lack temporal coherence and musical alignment. To address these challenges, we propose $MotionRAG-Diff$ , a hybrid framework that integrates Retrieval-Augmented Generation (RAG) with diffusion-based refinement to enable high-quality, musically coherent dance generation for arbitrary long-term music inputs. Our method introduces three core innovations: (1) A cross-modal contrastive learning architecture that aligns heterogeneous music and dance representations in a shared latent space, establishing unsupervised semantic correspondence without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Music Technology and Sound Studies · Human Motion and Animation

MethodsDiffusion · Contrastive Learning