RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text
Jiaben Chen, Xin Yan, Yihang Chen, Siyuan Cen, Zixin Wang, Qinwei Ma, Haoyu Zhen, Kaizhi Qian, Lie Lu, Chuang Gan

TL;DR
This paper introduces RapVerse, a novel framework that generates synchronized 3D body motions and singing vocals from text, utilizing a large new dataset and multimodal transformers to produce realistic, coherent performances.
Contribution
The work presents the first unified model for simultaneous generation of vocals and full-body motions from text, supported by a new dataset and multimodal transformer architecture.
Findings
Generated outputs are coherent and realistic, matching textual inputs.
The framework rivals specialized single-modality systems in quality.
Established new benchmarks for joint vocal-motion generation.
Abstract
In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rapping vocals, lyrics, and high-quality 3D holistic body meshes. With the RapVerse dataset, we investigate the extent to which scaling autoregressive multimodal transformers across language, audio, and motion can enhance the coherent and realistic generation of vocals and whole-body human motions. For modality unification, a vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, while a vocal-to-unit model is leveraged to obtain quantized audio tokens preserving content, prosodic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Human Motion and Animation
