MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls
Yuxuan Bian, Ailing Zeng, Xuan Ju, Xian Liu, Zhaoyang Zhang, Wei Liu,, Qiang Xu

TL;DR
MotionCraft introduces a unified diffusion transformer for multimodal whole-body motion generation, effectively handling diverse control modalities and motion formats through a coarse-to-fine training strategy and novel graph modeling.
Contribution
The paper presents MotionCraft, a novel framework with plug-and-play multimodal control, a two-stage training process, and a new benchmark, MC-Bench, for improved multimodal motion generation.
Findings
Achieves state-of-the-art results on multiple motion generation tasks
Effectively models static and dynamic human topology graphs
Addresses motion format inconsistency with the new MC-Bench benchmark
Abstract
Whole-body multimodal motion generation, controlled by text, speech, or music, has numerous applications including video generation and character animation. However, employing a unified model to achieve various generation tasks with different condition modalities presents two main challenges: motion distribution drifts across different tasks (e.g., co-speech gestures and text-driven daily actions) and the complex optimization of mixed conditions with varying granularities (e.g., text and audio). Additionally, inconsistent motion formats across different tasks and datasets hinder effective training toward multimodal motion generation. In this paper, we propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control. Our framework employs a coarse-to-fine training strategy, starting with the first stage of text-to-motion semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Hand Gesture Recognition Systems
MethodsDiffusion
