TMD-Bench: A Multi-Level Evaluation Paradigm for Music-Dance Co-Generation
Xiaoda Yang, Majun Zhang, Changhao Pan, Nick Huang, Yang Yuguang, Fan Zhuo, Pengfei Zhou, Jin Zhou, Sizhe Shan, Shan Yang, Miles Yang, Yang You, Zhou Zhao

TL;DR
TMD-Bench is a comprehensive benchmark for evaluating text-driven music-dance co-generation, focusing on unimodal quality, instruction adherence, and rhythmic alignment, supported by a new dataset and evaluation metrics.
Contribution
The paper introduces TMD-Bench, a novel benchmark with metrics, a dataset, and a baseline model for assessing music-dance co-generation systems.
Findings
Commercial models like Veo 3 and Sora 2 excel in quality but lack rhythmic coupling.
The RhyJAM baseline achieves competitive beat synchronization.
Rhythmic and kinetic coherence remains an area for improvement.
Abstract
Unified audio-visual generation is rapidly gaining industrial and creative relevance, enabling applications in virtual production and interactive media. However, when moving from general audio-video synthesis to music-dance co-generation, the task becomes substantially harder: musical rhythm, phrasing, and accents must drive choreographic motion at fine temporal resolution, and such rhythmic coupling is not captured by unimodal metrics or generic audiovisual consistency scores used in current evaluation practice. We introduce TMD-Bench, a benchmark for text-driven music-dance co-generation that assesses systems across unimodal generation quality, instruction adherence, and cross-modal rhythmic alignment. The benchmark integrates computable physical metrics with perceptual multimodal judgments, and is supported by a curated rhythm-aligned music-dance dataset and a fine-grained Music…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
