MaskControl: Spatio-Temporal Control for Masked Motion Synthesis

Ekkasit Pinyoanuntapong; Muhammad Usama Saleem; Korrawe Karunratanakul; Pu Wang; Hongfei Xue; Chen Chen; Chuan Guo; Junli Cao; Jian Ren; Sergey Tulyakov

arXiv:2410.10780·cs.CV·October 21, 2025

MaskControl: Spatio-Temporal Control for Masked Motion Synthesis

Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Korrawe Karunratanakul, Pu Wang, Hongfei Xue, Chen Chen, Chuan Guo, Junli Cao, Jian Ren, Sergey Tulyakov

PDF

Open Access

TL;DR

MaskControl introduces a novel method for high-precision, controllable masked motion synthesis by combining logits regularization, explicit logit optimization, and differentiable expectation sampling, outperforming previous methods in quality and control accuracy.

Contribution

It is the first approach to add controllability to masked motion models through innovative training and inference techniques, including a new sampling method.

Findings

01

Significantly reduces motion generation error (~77% FID improvement)

02

Achieves higher control precision with an average error of 0.91

03

Enables diverse control applications like joint, frame, and zero-shot control

Abstract

Recent advances in motion diffusion models have enabled spatially controllable text-to-motion generation. However, these models struggle to achieve high-precision control while maintaining high-quality motion generation. To address these challenges, we propose MaskControl, the first approach to introduce controllability to the generative masked motion model. Our approach introduces two key innovations. First, \textit{Logits Regularizer} implicitly perturbs logits at training time to align the distribution of motion tokens with the controlled joint positions, while regularizing the categorical token prediction to ensure high-fidelity generation. Second, \textit{Logit Optimization} explicitly optimizes the predicted logits during inference time, directly reshaping the token distribution that forces the generated motion to accurately align with the controlled joint positions. Moreover, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Augmented Reality Applications · Interactive and Immersive Displays

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion