Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control

Nhat Le; Daochang Liu; Anh Nguyen; and Ajmal Mian

arXiv:2605.14935·cs.CV·May 15, 2026

Multi-scale Coarse-to-fine Modeling for Test-time Human Motion Control

Nhat Le, Daochang Liu, Anh Nguyen, and Ajmal Mian

PDF

TL;DR

MSCoT introduces a multi-scale, coarse-to-fine hierarchical model for efficient, controllable human motion synthesis that outperforms existing methods in quality, accuracy, and speed.

Contribution

The paper proposes a novel multi-scale hierarchical modeling and guidance strategy for test-time human motion control, enabling fast, precise, and flexible motion generation.

Findings

01

Achieves 48% FID improvement over baselines.

02

Reduces average control error by 61%.

03

Offers 10x faster inference speed.

Abstract

We present MSCoT, a multi-scale, coarse-to-fine model for test-time human motion synthesis and control. Unlike recent approaches that rely on multiple iterative denoising/token-prediction steps, or modules tailored for specific control signals, MSCoT discretizes motion into a multi-scale hierarchical representation and predicts the entire token sequence at each temporal scale in a coarse-to-fine fashion. Building on this coarse-to-fine paradigm, we propose an efficient multi-scale token guidance strategy that overcomes the challenge of discrete sampling and steers the token distribution towards the control goals, allowing for fast and flexible control. To address the limitations of a discrete codebook, a lightweight token refiner further adds continuous residuals to the discrete token embeddings and allows differentiable test-time refinement optimization to ensure precise alignment with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.