T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete   Representations

Jianrong Zhang; Yangsong Zhang; Xiaodong Cun; Shaoli Huang; Yong; Zhang; Hongwei Zhao; Hongtao Lu; Xi Shen

arXiv:2301.06052·cs.CV·September 26, 2023·20 cites

T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations

Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong, Zhang, Hongwei Zhao, Hongtao Lu, Xi Shen

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces T2M-GPT, a simple yet effective framework combining VQ-VAE and GPT for generating human motion from text descriptions, outperforming recent diffusion models on key metrics.

Contribution

Proposes a straightforward VQ-VAE and GPT-based method for text-to-human motion generation, demonstrating competitive performance and highlighting VQ-VAE's continued relevance.

Findings

01

Achieves high-quality discrete motion representations.

02

Outperforms diffusion-based models on FID score.

03

Identifies dataset size as a limiting factor.

Abstract

In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (R-Precision), but with FID 0.116 largely outperforming MotionDiffuse of 0.630. Additionally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Mael-zys/T2M-GPT
pytorchOfficial

Models

🤗
vumichien/T2M-GPT
model· ♡ 16
♡ 16

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Human Motion and Animation · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Dropout · Softmax · Adam · Cosine Annealing · Discriminative Fine-Tuning · Linear Warmup With Cosine Annealing