Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

Chenyang Gu; Mingyuan Zhang; Haozhe Xie; Zhongang Cai; Lei Yang; Ziwei Liu

arXiv:2603.19227·cs.CV·March 20, 2026

Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

Chenyang Gu, Mingyuan Zhang, Haozhe Xie, Zhongang Cai, Lei Yang, Ziwei Liu

PDF

Open Access

TL;DR

This paper introduces MoTok, a diffusion-based discrete motion tokenizer that effectively combines semantic understanding and kinematic control for motion generation, significantly improving fidelity and controllability with fewer tokens.

Contribution

The paper presents MoTok, a novel diffusion-based discrete motion tokenizer that decouples semantic abstraction from motion reconstruction, enhancing motion fidelity and controllability in a three-stage framework.

Findings

01

Reduces trajectory error from 0.72 cm to 0.08 cm.

02

Decreases FID from 0.083 to 0.029.

03

Improves fidelity under strong kinematic constraints.

Abstract

Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning