MoSa: Motion Generation with Scalable Autoregressive Modeling
Mengyuan Liu, Sheng Yan, Yong Wang, Yingjie Li, Gui-Bin Bian, Hong Liu

TL;DR
MoSa introduces a hierarchical, scalable autoregressive framework for text-driven 3D human motion generation, achieving state-of-the-art quality and efficiency with fewer inference steps and strong generalization capabilities.
Contribution
The paper proposes MoSa, a novel hierarchical motion generation method that employs a multi-scale token preservation strategy and scalable autoregressive modeling, significantly improving speed and quality over prior approaches.
Findings
MoSa achieves an FID of 0.06 on Motion-X, outperforming previous methods.
Inference time is reduced by 27% compared to prior models.
MoSa effectively generalizes to motion editing tasks without additional fine-tuning.
Abstract
We introduce MoSa, a novel hierarchical motion generation framework for text-driven 3D human motion generation that enhances the Vector Quantization-guided Generative Transformers (VQ-GT) paradigm through a coarse-to-fine scalable generation process. In MoSa, we propose a Multi-scale Token Preservation Strategy (MTPS) integrated into a hierarchical residual vector quantization variational autoencoder (RQ-VAE). MTPS employs interpolation at each hierarchical quantization to effectively retain coarse-to-fine multi-scale tokens. With this, the generative transformer supports Scalable Autoregressive (SAR) modeling, which predicts scale tokens, unlike traditional methods that predict only one token at each step. Consequently, MoSa requires only 10 inference steps, matching the number of RQ-VAE quantization layers. To address potential reconstruction degradation from frequent interpolation,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · 3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis
