Next-Scale Autoregressive Models for Text-to-Motion Generation

Zhiwei Zheng; Shibo Jin; Lingjie Liu; Mingmin Zhao

arXiv:2604.03799·cs.CV·April 7, 2026

Next-Scale Autoregressive Models for Text-to-Motion Generation

Zhiwei Zheng, Shibo Jin, Lingjie Liu, Mingmin Zhao

PDF

TL;DR

MoScale introduces a hierarchical autoregressive framework for text-to-motion generation, improving long-range motion structure, robustness, and zero-shot generalization.

Contribution

It proposes a novel next-scale AR approach that hierarchically generates motion from coarse to fine, enhancing stability and scalability in text-conditioned motion synthesis.

Findings

01

Achieves state-of-the-art performance in text-to-motion tasks.

02

Demonstrates high training efficiency and effective scaling.

03

Generalizes well to diverse motion generation and editing tasks in zero-shot settings.

Abstract

Autoregressive (AR) models offer stable and efficient training, but standard next-token prediction is not well aligned with the temporal structure required for text-conditioned motion generation. We introduce MoScale, a next-scale AR framework that generates motion hierarchically from coarse to fine temporal resolutions. By providing global semantics at the coarsest scale and refining them progressively, MoScale establishes a causal hierarchy better suited for long-range motion structure. To improve robustness under limited text-motion data, we further incorporate cross-scale hierarchical refinement for improving per-scale initial predictions and in-scale temporal refinement for selective bidirectional re-prediction. MoScale achieves SOTA text-to-motion performance with high training efficiency, scales effectively with model size, and generalizes zero-shot to diverse motion generation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.