TeMuDance: Contrastive Alignment-Based Textual Control for Music-Driven Dance Generation
Xinran Liu, Diptesh Kanojia, Wenwu Wang, Zhenhua Feng

TL;DR
TeMuDance is a novel framework enabling natural language control over music-driven dance generation by aligning disjoint datasets in a shared semantic space, without needing manually annotated triplets.
Contribution
It introduces a motion-centered bridging paradigm and a lightweight text control branch to enhance semantic controllability in dance generation without extensive labeled data.
Findings
TeMuDance achieves high-quality dance generation with improved text control.
The framework effectively aligns music, text, and motion in a shared embedding space.
Experimental results show competitive dance quality and enhanced semantic controllability.
Abstract
Existing music-driven dance generation approaches have achieved strong realism and effective audio-motion alignment. However, they generally lack semantic controllability, making it difficult to guide specific movements through natural language descriptions. This limitation primarily stems from the absence of large-scale datasets that jointly align music, text, and motion for supervised learning of text-conditioned control. To address this challenge, we propose TeMuDance, a framework that enables text-based control for music-conditioned dance generation without requiring any manually annotated music-text-motion triplet dataset. TeMuDance introduces a motion-centred bridging paradigm that leverages motion as a shared semantic anchor to align disjoint music-dance and text-motion datasets within a unified embedding space, enabling cross-modal retrieval of missing modalities for end-to-end…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
