UniMuMo: Unified Text, Music and Motion Generation

Han Yang; Kun Su; Yutong Zhang; Jiaben Chen; Kaizhi Qian; Gaowen Liu,; Chuang Gan

arXiv:2410.04534·cs.SD·October 8, 2024

UniMuMo: Unified Text, Music and Motion Generation

Han Yang, Kun Su, Yutong Zhang, Jiaben Chen, Kaizhi Qian, Gaowen Liu,, Chuang Gan

PDF

Open Access 1 Repo 1 Video

TL;DR

UniMuMo is a versatile multimodal model that unifies text, music, and motion generation using a transformer architecture, aligning unpaired data through rhythmic patterns and fine-tuning pre-trained models for efficient cross-modal synthesis.

Contribution

It introduces a unified transformer framework for simultaneous text, music, and motion generation, leveraging rhythmic alignment and pre-trained models to reduce computational costs.

Findings

01

Achieves competitive results across all modalities

02

Successfully unifies music and motion generation tasks

03

Demonstrates effective cross-modal data alignment

Abstract

We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. To address the lack of time-synchronized data, we align unpaired music and motion data based on rhythmic patterns to leverage existing large-scale music-only and motion-only datasets. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture. To support multiple generation tasks within a single framework, we introduce several architectural improvements. We propose encoding motion with a music codebook, mapping motion into the same feature space as music. We introduce a music-motion parallel generation scheme that unifies all music and motion generation tasks into a single transformer decoder architecture with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hanyangclarence/UniMuMo
pytorchOfficial

Videos

UniMuMo: Unified Text, Music, and Motion Generation· underline

Taxonomy

TopicsHuman Motion and Animation · Music and Audio Processing · Speech Recognition and Synthesis

MethodsALIGN