M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation
Mingshuang Luo, Ruibing Hou, Zhuo Li, Hong Chang, Zimo Liu, Yaowei, Wang, Shiguang Shan

TL;DR
M$^3$GPT is a novel multimodal, multitask framework that enables comprehensive motion understanding and generation by integrating multiple signals and modeling their interactions within a unified system.
Contribution
It introduces a unified multimodal representation space, direct raw motion modeling, and task synergy learning, pioneering motion comprehension and generation from diverse signals.
Findings
Superior performance on motion tasks
Effective zero-shot generalization
Seamless multimodal integration
Abstract
This paper presents MGPT, an advanced ultimodal, ultitask framework for otion comprehension and generation. MGPT operates on three fundamental principles. The first focuses on creating a unified representation space for various motion-relevant modalities. We employ discrete vector quantization for multimodal conditional signals, such as text, music and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary. The second involves modeling motion generation directly in the raw motion space. This strategy circumvents the information loss associated with a discrete tokenizer, resulting in more detailed and comprehensive motion generation. Third, MGPT learns to model the connections and synergies among various motion-relevant tasks. Text, the most familiar and well-understood modality for LLMs, is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsVideo Analysis and Summarization · Human Pose and Action Recognition · Multimodal Machine Learning Applications
