T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations
Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong, Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen

TL;DR
This paper introduces T2M-GPT, a simple yet effective framework combining VQ-VAE and GPT for generating human motion from text descriptions, outperforming recent diffusion models on key metrics.
Contribution
Proposes a straightforward VQ-VAE and GPT-based method for text-to-human motion generation, demonstrating competitive performance and highlighting VQ-VAE's continued relevance.
Findings
Achieves high-quality discrete motion representations.
Outperforms diffusion-based models on FID score.
Identifies dataset size as a limiting factor.
Abstract
In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (R-Precision), but with FID 0.116 largely outperforming MotionDiffuse of 0.630. Additionally,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Human Motion and Animation · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Dropout · Softmax · Adam · Cosine Annealing · Discriminative Fine-Tuning · Linear Warmup With Cosine Annealing
