A Unified Framework for Multimodal, Multi-Part Human Motion Synthesis
Zixiang Zhou, Yu Wan, Baoyuan Wang

TL;DR
This paper presents a scalable, unified framework for synthesizing multimodal and multi-part human motion by quantizing motions, using pre-trained models for encoding signals, and predicting motion tokens.
Contribution
It introduces a novel token prediction-based approach that unifies multimodal and multi-part human motion synthesis, enhancing scalability and integration of new modalities.
Findings
Effective in generating realistic multi-part motions
Scalable framework easily incorporates new modalities
Demonstrates broad applicability through extensive experiments
Abstract
The field has made significant progress in synthesizing realistic human motion driven by various modalities. Yet, the need for different methods to animate various body parts according to different control signals limits the scalability of these techniques in practical scenarios. In this paper, we introduce a cohesive and scalable approach that consolidates multimodal (text, music, speech) and multi-part (hand, torso) human motion generation. Our methodology unfolds in several steps: We begin by quantizing the motions of diverse body parts into separate codebooks tailored to their respective domains. Next, we harness the robust capabilities of pre-trained models to transcode multimodal signals into a shared latent space. We then translate these signals into discrete motion tokens by iteratively predicting subsequent tokens to form a complete sequence. Finally, we reconstruct the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · Hand Gesture Recognition Systems
