Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion

Zongye Zhang; Bohan Kong; Qingjie Liu; Yunhong Wang

arXiv:2505.11013·cs.CV·January 9, 2026

Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion

Zongye Zhang, Bohan Kong, Qingjie Liu, Yunhong Wang

PDF

Open Access 1 Models

TL;DR

This paper introduces MoMADiff, a novel framework combining masked modeling and diffusion for robust, controllable 3D human motion generation from text, capable of generalizing to unseen motions with fine-grained control.

Contribution

It proposes a new motion generation method that integrates masked autoregressive diffusion with user-specified keyframes for enhanced control and generalization.

Findings

01

Outperforms state-of-the-art in motion quality and control

02

Demonstrates strong generalization to novel motions

03

Supports flexible keyframe-based motion synthesis

Abstract

Generating 3D human motion from text descriptions remains challenging due to the diverse and complex nature of human motion. While existing methods excel within the training distribution, they often struggle with out-of-distribution motions, limiting their applicability in real-world scenarios. Existing VQVAE-based methods often fail to represent novel motions faithfully using discrete tokens, which hampers their ability to generalize beyond seen data. Meanwhile, diffusion-based methods operating on continuous representations often lack fine-grained control over individual frames. To address these challenges, we propose a robust motion generation framework MoMADiff, which combines masked modeling with diffusion processes to generate motion using frame-level continuous representations. Our model supports flexible user-provided keyframe specification, enabling precise control over both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
SteveZh/momadiff_models
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Human Pose and Action Recognition · 3D Shape Modeling and Analysis

MethodsDiffusion