MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation

Dongjie Fu; Tengjiao Sun; Pengcheng Fang; Xiaohao Cai; Hansung Kim

arXiv:2506.05952·cs.CV·May 5, 2026

MOGO: Residual Quantized Hierarchical Causal Transformer for High-Quality and Real-Time 3D Human Motion Generation

Dongjie Fu, Tengjiao Sun, Pengcheng Fang, Xiaohao Cai, Hansung Kim

PDF

TL;DR

MOGO is a novel autoregressive framework for real-time 3D human motion generation that combines hierarchical quantization and causal transformers to produce high-quality, responsive motion from text prompts.

Contribution

It introduces MoSA-VQ and RQHC-Transformer modules for efficient, hierarchical motion encoding and decoding, enabling real-time, high-fidelity motion synthesis from text.

Findings

01

MOGO achieves state-of-the-art quality on benchmark datasets.

02

It significantly reduces inference latency for real-time applications.

03

The model generalizes well in zero-shot motion generation scenarios.

Abstract

Recent advances in transformer-based text-to-motion generation have led to impressive progress in synthesizing high-quality human motion. Nevertheless, jointly achieving high fidelity, streaming capability, real-time responsiveness, and scalability remains a fundamental challenge. In this paper, we propose MOGO (Motion Generation with One-pass), a novel autoregressive framework tailored for efficient and real-time 3D motion generation. MOGO comprises two key components: (1) MoSA-VQ, a motion scale-adaptive residual vector quantization module that hierarchically discretizes motion sequences with learnable scaling to produce compact yet expressive representations; and (2) RQHC-Transformer, a residual quantized hierarchical causal transformer that generates multi-layer motion tokens in a single forward pass, significantly reducing inference latency. To enhance semantic fidelity, we further…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.