MoSa: Motion Generation with Scalable Autoregressive Modeling

Mengyuan Liu; Sheng Yan; Yong Wang; Yingjie Li; Gui-Bin Bian; Hong Liu

arXiv:2511.01200·cs.CV·November 4, 2025

MoSa: Motion Generation with Scalable Autoregressive Modeling

Mengyuan Liu, Sheng Yan, Yong Wang, Yingjie Li, Gui-Bin Bian, Hong Liu

PDF

Open Access

TL;DR

MoSa introduces a hierarchical, scalable autoregressive framework for text-driven 3D human motion generation, achieving state-of-the-art quality and efficiency with fewer inference steps and strong generalization capabilities.

Contribution

The paper proposes MoSa, a novel hierarchical motion generation method that employs a multi-scale token preservation strategy and scalable autoregressive modeling, significantly improving speed and quality over prior approaches.

Findings

01

MoSa achieves an FID of 0.06 on Motion-X, outperforming previous methods.

02

Inference time is reduced by 27% compared to prior models.

03

MoSa effectively generalizes to motion editing tasks without additional fine-tuning.

Abstract

We introduce MoSa, a novel hierarchical motion generation framework for text-driven 3D human motion generation that enhances the Vector Quantization-guided Generative Transformers (VQ-GT) paradigm through a coarse-to-fine scalable generation process. In MoSa, we propose a Multi-scale Token Preservation Strategy (MTPS) integrated into a hierarchical residual vector quantization variational autoencoder (RQ-VAE). MTPS employs interpolation at each hierarchical quantization to effectively retain coarse-to-fine multi-scale tokens. With this, the generative transformer supports Scalable Autoregressive (SAR) modeling, which predicts scale tokens, unlike traditional methods that predict only one token at each step. Consequently, MoSa requires only 10 inference steps, matching the number of RQ-VAE quantization layers. To address potential reconstruction degradation from frequent interpolation,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · 3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis