Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer
Chenyang Gu, Mingyuan Zhang, Haozhe Xie, Zhongang Cai, Lei Yang, Ziwei Liu

TL;DR
This paper introduces MoTok, a diffusion-based discrete motion tokenizer that effectively combines semantic understanding and kinematic control for motion generation, significantly improving fidelity and controllability with fewer tokens.
Contribution
The paper presents MoTok, a novel diffusion-based discrete motion tokenizer that decouples semantic abstraction from motion reconstruction, enhancing motion fidelity and controllability in a three-stage framework.
Findings
Reduces trajectory error from 0.72 cm to 0.08 cm.
Decreases FID from 0.083 to 0.029.
Improves fidelity under strong kinematic constraints.
Abstract
Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning
