Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation
Dongjie Fu

TL;DR
Mogo is a novel hierarchical transformer architecture that generates high-quality, long-duration 3D human motions efficiently, outperforming existing models in quality, length, and out-of-distribution robustness.
Contribution
It introduces a single transformer model combining hierarchical residual vector quantization and causal transformer for superior motion generation without extra refinement modules.
Findings
Surpasses existing models in motion quality and length.
Achieves state-of-the-art FID score of 0.079 on HumanML3D.
Demonstrates strong out-of-distribution generation performance.
Abstract
In the field of text-to-motion generation, Bert-type Masked Models (MoMask, MMM) currently produce higher-quality outputs compared to GPT-type autoregressive models (T2M-GPT). However, these Bert-type models often lack the streaming output capability required for applications in video game and multimedia environments, a feature inherent to GPT-type models. Additionally, they demonstrate weaker performance in out-of-distribution generation. To surpass the quality of BERT-type models while leveraging a GPT-type structure, without adding extra refinement models that complicate scaling data, we propose a novel architecture, Mogo (Motion Only Generate Once), which generates high-quality lifelike 3D human motions by training a single transformer model. Mogo consists of only two main components: 1) RVQ-VAE, a hierarchical residual vector quantization variational autoencoder, which discretizes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Human Motion and Animation
