Kimodo: Scaling Controllable Human Motion Generation

Davis Rempe; Mathis Petrovich; Ye Yuan; Haotian Zhang; Xue Bin Peng; Yifeng Jiang; Tingwu Wang; Umar Iqbal; David Minor; Michael de Ruyter; Jiefeng Li; Chen Tessler; Edy Lim; Eugene Jeong; Sam Wu; Ehsan Hassani; Michael Huang; Jin-Bey Yu; Chaeyeon Chung; Lina Song; Olivier Dionne; Jan Kautz; Simon Yuen; Sanja Fidler

arXiv:2603.15546·cs.CV·March 17, 2026

Kimodo: Scaling Controllable Human Motion Generation

Davis Rempe, Mathis Petrovich, Ye Yuan, Haotian Zhang, Xue Bin Peng, Yifeng Jiang, Tingwu Wang, Umar Iqbal, David Minor, Michael de Ruyter, Jiefeng Li, Chen Tessler, Edy Lim, Eugene Jeong, Sam Wu, Ehsan Hassani, Michael Huang, Jin-Bey Yu, Chaeyeon Chung, Lina Song

PDF

Open Access

TL;DR

Kimodo is a new kinematic motion diffusion model trained on 700 hours of mocap data, enabling high-quality, controllable human motion synthesis via text and various kinematic constraints.

Contribution

The paper introduces Kimodo, a scalable, controllable motion generation model trained on a large dataset, with a novel two-stage architecture for improved motion quality and flexibility.

Findings

01

Model achieves high-quality motion synthesis.

02

Scaling dataset and model size improves performance.

03

Flexible control through multiple kinematic constraints.

Abstract

High-quality human motion data is becoming increasingly important for applications in robotics, simulation, and entertainment. Recent generative models offer a potential data source, enabling human motion synthesis through intuitive inputs like text prompts or kinematic constraints on poses. However, the small scale of public mocap datasets has limited the motion quality, control accuracy, and generalization of these models. In this work, we introduce Kimodo, an expressive and controllable kinematic motion diffusion model trained on 700 hours of optical motion capture data. Our model generates high-quality motions while being easily controlled through text and a comprehensive suite of kinematic constraints including full-body keyframes, sparse joint positions/rotations, 2D waypoints, and dense 2D paths. This is enabled through a carefully designed motion representation and two-stage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Human Pose and Action Recognition · 3D Shape Modeling and Analysis