UniMuMo: Unified Text, Music and Motion Generation
Han Yang, Kun Su, Yutong Zhang, Jiaben Chen, Kaizhi Qian, Gaowen Liu,, Chuang Gan

TL;DR
UniMuMo is a versatile multimodal model that unifies text, music, and motion generation using a transformer architecture, aligning unpaired data through rhythmic patterns and fine-tuning pre-trained models for efficient cross-modal synthesis.
Contribution
It introduces a unified transformer framework for simultaneous text, music, and motion generation, leveraging rhythmic alignment and pre-trained models to reduce computational costs.
Findings
Achieves competitive results across all modalities
Successfully unifies music and motion generation tasks
Demonstrates effective cross-modal data alignment
Abstract
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. To address the lack of time-synchronized data, we align unpaired music and motion data based on rhythmic patterns to leverage existing large-scale music-only and motion-only datasets. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture. To support multiple generation tasks within a single framework, we introduce several architectural improvements. We propose encoding motion with a music codebook, mapping motion into the same feature space as music. We introduce a music-motion parallel generation scheme that unifies all music and motion generation tasks into a single transformer decoder architecture with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Motion and Animation · Music and Audio Processing · Speech Recognition and Synthesis
MethodsALIGN
