MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding
Yuan Wang, Di Huang, Yaqi Zhang, Wanli Ouyang, Jile Jiao, Xuetao Feng,, Yan Zhou, Pengfei Wan, Shixiang Tang, Dan Xu

TL;DR
MotionGPT-2 is a versatile large language model that generates and understands human motions from text and multimodal inputs, overcoming previous limitations in control and task specificity.
Contribution
It introduces a unified framework that integrates multimodal control and motion generation using discretized tokens and a novel Part-Aware VQVAE for detailed motion representation.
Findings
Effective across motion generation, captioning, and completion tasks.
Outperforms existing methods in generating lifelike, controllable human motions.
Demonstrates high adaptability and fine-grained motion control.
Abstract
Generating lifelike human motions from descriptive texts has experienced remarkable research focus in the recent years, propelled by the emerging requirements of digital humans.Despite impressive advances, existing approaches are often constrained by limited control modalities, task specificity, and focus solely on body motion representations.In this paper, we present MotionGPT-2, a unified Large Motion-Language Model (LMLM) that addresses these limitations. MotionGPT-2 accommodates multiple motion-relevant tasks and supporting multimodal control conditions through pre-trained Large Language Models (LLMs). It quantizes multimodal inputs-such as text and single-frame poses-into discrete, LLM-interpretable tokens, seamlessly integrating them into the LLM's vocabulary. These tokens are then organized into unified prompts, guiding the LLM to generate motion outputs through a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Speech and dialogue systems
MethodsFocus
