Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human   Motion Generation

Dongjie Fu

arXiv:2412.07797·cs.CV·December 12, 2024

Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation

Dongjie Fu

PDF

Open Access

TL;DR

Mogo is a novel hierarchical transformer architecture that generates high-quality, long-duration 3D human motions efficiently, outperforming existing models in quality, length, and out-of-distribution robustness.

Contribution

It introduces a single transformer model combining hierarchical residual vector quantization and causal transformer for superior motion generation without extra refinement modules.

Findings

01

Surpasses existing models in motion quality and length.

02

Achieves state-of-the-art FID score of 0.079 on HumanML3D.

03

Demonstrates strong out-of-distribution generation performance.

Abstract

In the field of text-to-motion generation, Bert-type Masked Models (MoMask, MMM) currently produce higher-quality outputs compared to GPT-type autoregressive models (T2M-GPT). However, these Bert-type models often lack the streaming output capability required for applications in video game and multimedia environments, a feature inherent to GPT-type models. Additionally, they demonstrate weaker performance in out-of-distribution generation. To surpass the quality of BERT-type models while leveraging a GPT-type structure, without adding extra refinement models that complicate scaling data, we propose a novel architecture, Mogo (Motion Only Generate Once), which generates high-quality lifelike 3D human motions by training a single transformer model. Mogo consists of only two main components: 1) RVQ-VAE, a hierarchical residual vector quantization variational autoencoder, which discretizes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Human Motion and Animation