Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion
Zongye Zhang, Bohan Kong, Qingjie Liu, Yunhong Wang

TL;DR
This paper introduces MoMADiff, a novel framework combining masked modeling and diffusion for robust, controllable 3D human motion generation from text, capable of generalizing to unseen motions with fine-grained control.
Contribution
It proposes a new motion generation method that integrates masked autoregressive diffusion with user-specified keyframes for enhanced control and generalization.
Findings
Outperforms state-of-the-art in motion quality and control
Demonstrates strong generalization to novel motions
Supports flexible keyframe-based motion synthesis
Abstract
Generating 3D human motion from text descriptions remains challenging due to the diverse and complex nature of human motion. While existing methods excel within the training distribution, they often struggle with out-of-distribution motions, limiting their applicability in real-world scenarios. Existing VQVAE-based methods often fail to represent novel motions faithfully using discrete tokens, which hampers their ability to generalize beyond seen data. Meanwhile, diffusion-based methods operating on continuous representations often lack fine-grained control over individual frames. To address these challenges, we propose a robust motion generation framework MoMADiff, which combines masked modeling with diffusion processes to generate motion using frame-level continuous representations. Our model supports flexible user-provided keyframe specification, enabling precise control over both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Human Pose and Action Recognition · 3D Shape Modeling and Analysis
MethodsDiffusion
