DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding

Ning Zhang; Zhengyu Li; Kwong Weng Loh; Mingxi Xu; Qi Wang; Zhengyu Wen; Xiaoyu He; Wei Zhao; Kehong Gong; Mingyuan Zhang

arXiv:2602.04188·cs.CV·February 9, 2026

DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding

Ning Zhang, Zhengyu Li, Kwong Weng Loh, Mingxi Xu, Qi Wang, Zhengyu Wen, Xiaoyu He, Wei Zhao, Kehong Gong, Mingyuan Zhang

PDF

Open Access

TL;DR

DiMo introduces a unified discrete diffusion framework for bidirectional motion understanding and generation, enabling high-quality, controllable motion synthesis and understanding from text and motion data.

Contribution

The paper presents DiMo, a novel discrete diffusion model that unifies multiple motion tasks within a single framework, improving fidelity and controllability over prior autoregressive methods.

Findings

01

Strong motion quality demonstrated on HumanML3D and KIT-ML datasets.

02

Effective bidirectional understanding of text and motion.

03

Supports text-free motion completion and motion caption correction.

Abstract

Prior masked modeling motion generation methods predominantly study text-to-motion. We present DiMo, a discrete diffusion-style framework, which extends masked modeling to bidirectional text--motion understanding and generation. Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, DiMo performs iterative masked token refinement, unifying Text-to-Motion (T2M), Motion-to-Text (M2T), and text-free Motion-to-Motion (M2M) within a single model. This decoding paradigm naturally enables a quality-latency trade-off at inference via the number of refinement steps. We further improve motion token fidelity with residual vector quantization (RVQ) and enhance alignment and controllability with Group Relative Policy Optimization (GRPO). Experiments on HumanML3D and KIT-ML show strong motion quality and competitive bidirectional understanding under a unified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis