DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation
Qiaosong Qi, Le Zhuo, Aixi Zhang, Yue Liao, Fei Fang, Si Liu,, Shuicheng Yan

TL;DR
DiffDance is a novel cascaded diffusion model that generates realistic, long-form dance sequences aligned with music, overcoming limitations of autoregressive methods by using a two-stage diffusion approach and advanced training techniques.
Contribution
The paper introduces a cascaded diffusion framework for dance generation, combining music-to-dance and super-resolution models with contrastive and geometric losses for improved realism and alignment.
Findings
Produces high-quality, long-form dance sequences
Achieves results comparable to state-of-the-art autoregressive methods
Demonstrates effective music-motion alignment on AIST++ dataset
Abstract
When hearing music, it is natural for people to dance to its rhythm. Automatic dance generation, however, is a challenging task due to the physical constraints of human motion and rhythmic alignment with target music. Conventional autoregressive methods introduce compounding errors during sampling and struggle to capture the long-term structure of dance sequences. To address these limitations, we present a novel cascaded motion diffusion model, DiffDance, designed for high-resolution, long-form dance generation. This model comprises a music-to-dance diffusion model and a sequence super-resolution diffusion model. To bridge the gap between music and motion for conditional generation, DiffDance employs a pretrained audio representation learning model to extract music embeddings and further align its embedding space to motion via contrastive loss. During training our cascaded diffusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsDiffusion · ALIGN
