MotionGPT-2: A General-Purpose Motion-Language Model for Motion   Generation and Understanding

Yuan Wang; Di Huang; Yaqi Zhang; Wanli Ouyang; Jile Jiao; Xuetao Feng,; Yan Zhou; Pengfei Wan; Shixiang Tang; Dan Xu

arXiv:2410.21747·cs.CV·October 30, 2024

MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding

Yuan Wang, Di Huang, Yaqi Zhang, Wanli Ouyang, Jile Jiao, Xuetao Feng,, Yan Zhou, Pengfei Wan, Shixiang Tang, Dan Xu

PDF

Open Access

TL;DR

MotionGPT-2 is a versatile large language model that generates and understands human motions from text and multimodal inputs, overcoming previous limitations in control and task specificity.

Contribution

It introduces a unified framework that integrates multimodal control and motion generation using discretized tokens and a novel Part-Aware VQVAE for detailed motion representation.

Findings

01

Effective across motion generation, captioning, and completion tasks.

02

Outperforms existing methods in generating lifelike, controllable human motions.

03

Demonstrates high adaptability and fine-grained motion control.

Abstract

Generating lifelike human motions from descriptive texts has experienced remarkable research focus in the recent years, propelled by the emerging requirements of digital humans.Despite impressive advances, existing approaches are often constrained by limited control modalities, task specificity, and focus solely on body motion representations.In this paper, we present MotionGPT-2, a unified Large Motion-Language Model (LMLM) that addresses these limitations. MotionGPT-2 accommodates multiple motion-relevant tasks and supporting multimodal control conditions through pre-trained Large Language Models (LLMs). It quantizes multimodal inputs-such as text and single-frame poses-into discrete, LLM-interpretable tokens, seamlessly integrating them into the LLM's vocabulary. These tokens are then organized into unified prompts, guiding the LLM to generate motion outputs through a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Speech and dialogue systems

MethodsFocus