PF-D2M: A Pose-free Diffusion Model for Universal Dance-to-Music Generation

Jaekwon Im; Natalia Polouliakh; Taketo Akama

arXiv:2601.15872·cs.SD·January 23, 2026

PF-D2M: A Pose-free Diffusion Model for Universal Dance-to-Music Generation

Jaekwon Im, Natalia Polouliakh, Taketo Akama

PDF

Open Access

TL;DR

PF-D2M is a universal diffusion model that generates music aligned with dance videos, effectively handling multiple dancers and limited data through a progressive training strategy, achieving state-of-the-art results.

Contribution

Introduces PF-D2M, a novel dance-to-music generation model using visual features and progressive training to improve generalization and performance.

Findings

01

Achieves state-of-the-art dance-music alignment.

02

Effective in multi-dancer and non-human dancer scenarios.

03

Outperforms existing methods in music quality.

Abstract

Dance-to-music generation aims to generate music that is aligned with dance movements. Existing approaches typically rely on body motion features extracted from a single human dancer and limited dance-to-music datasets, which restrict their performance and applicability to real-world scenarios involving multiple dancers and non-human dancers. In this paper, we propose PF-D2M, a universal diffusion-based dance-to-music generation model that incorporates visual features extracted from dance videos. PF-D2M is trained with a progressive training strategy that effectively addresses data scarcity and generalization challenges. Both objective and subjective evaluations show that PF-D2M achieves state-of-the-art performance in dance-music alignment and music quality.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Music Technology and Sound Studies · Generative Adversarial Networks and Image Synthesis