ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation
Inwoo Hwang, Hojun Jang, Bing Zhou, Jian Wang, Young Min Kim, Chuan Guo

TL;DR
ScaleMoGen introduces a novel scale-wise autoregressive framework for text-driven human motion generation, utilizing a coarse-to-fine process with multi-scale tokenization to improve detail preservation and structural integrity.
Contribution
It proposes a multi-scale autoregressive approach with explicit skeletal hierarchy preservation and bitwise quantization, achieving state-of-the-art results in human motion generation.
Findings
Achieved an FID of 0.030 on HumanML3D, outperforming previous methods.
Attained a CLIP Score of 0.693 on SnapMoGen, surpassing prior models.
Enabled training-free, text-guided motion editing with multi-scale representation.
Abstract
We present ScaleMoGen, a scale-wise autoregressive framework for text-driven human motion generation. Unlike conventional autoregressive approaches that rely on standard next-token prediction, ScaleMoGen frames motion generation as a coarse-to-fine process. We quantize 3D motions into compositional discrete tokens across multiple skeletal-emporal scales of increasing granularity, learning to generate motion by autoregressively predicting next-scale token maps. To maintain structural integrity, our motion tokenizers and quantizers are explicitly designed so that discrete tokens at every scale strictly preserve the skeletal hierarchy. Additionally, we employ bitwise quantization and prediction, which efficiently scale up the tokenizer vocabulary to preserve motion details and stabilize optimization. Extensive experiments demonstrate that ScaleMoGen achieves state-of-the-art performance,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
