Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control
Nhat Le, Daochang Liu, Anh Nguyen, and Ajmal Mian

TL;DR
MSCoT introduces a multi-scale, coarse-to-fine hierarchical model for efficient, controllable human motion synthesis that outperforms existing methods in quality, accuracy, and speed.
Contribution
The paper proposes a novel multi-scale hierarchical modeling and guidance strategy for test-time human motion control, enabling fast, precise, and flexible motion generation.
Findings
Achieves 48% FID improvement over baselines.
Reduces average control error by 61%.
Offers 10x faster inference speed.
Abstract
We present MSCoT, a multi-scale, coarse-to-fine model for test-time human motion synthesis and control. Unlike recent approaches that rely on multiple iterative denoising/token-prediction steps, or modules tailored for specific control signals, MSCoT discretizes motion into a multi-scale hierarchical representation and predicts the entire token sequence at each temporal scale in a coarse-to-fine fashion. Building on this coarse-to-fine paradigm, we propose an efficient multi-scale token guidance strategy that overcomes the challenge of discrete sampling and steers the token distribution towards the control goals, allowing for fast and flexible control. To address the limitations of a discrete codebook, a lightweight token refiner further adds continuous residuals to the discrete token embeddings and allows differentiable test-time refinement optimization to ensure precise alignment with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
