Every Image Listens, Every Image Dances: Music-Driven Image Animation
Zhikang Dong, Weituo Hao, Ju-Chiang Wang, Peng Zhang, Pawel Polak

TL;DR
MuseDance is a novel end-to-end model that animates images using music and text, enabling personalized, synchronized dance videos without complex motion guidance, and introduces a new multimodal dance dataset.
Contribution
The paper presents MuseDance, a new diffusion-based model for music and text-driven image animation, and provides a comprehensive multimodal dance dataset for research.
Findings
MuseDance achieves synchronized and personalized dance animations.
The model generalizes well across diverse images and music.
The dataset supports future research in multimodal dance video generation.
Abstract
Image animation has become a promising area in multimodal research, with a focus on generating videos from reference images. While prior work has largely emphasized generic video generation guided by text, music-driven dance video generation remains underexplored. In this paper, we introduce MuseDance, an innovative end-to-end model that animates reference images using both music and text inputs. This dual input enables MuseDance to generate personalized videos that follow text descriptions and synchronize character movements with the music. Unlike existing approaches, MuseDance eliminates the need for complex motion guidance inputs, such as pose or depth sequences, making flexible and creative video generation accessible to users of all expertise levels. To advance research in this field, we present a new multimodal dataset comprising 2,904 dance videos with corresponding background…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCinema and Media Studies
MethodsFocus
